Single-Shot Multi-Person 3D Body Pose Estimation From Monocular RGB Input
We propose a new efficient single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera. Our fully convolutional DNN-based approach jointly infers 2D and 3D joint locations on the basis of an extended 3D location map supported by body part associations. This new formulation enables the readout of full body poses at a subset of visible joints without the need for explicit bounding box tracking. It therefore succeeds even under strong partial body occlusions by other people and objects in the scene. We also contribute the first training data set showing real images of sophisticated multi-person interactions and occlusions. To this end, we leverage multi-view video-based performance capture of individual people for ground truth annotation and a new image compositing for user-controlled synthesis of large corpora of real multi-person images. We also propose a new video-recorded multi-person test set with ground truth 3D annotations. Our method achieves state-of-the-art performance on challenging multi-person scenes.
Single-person pose estimation, both 2D and 3D, from monocular RGB input is a challenging and widely studied problem in vision [andriluka_pictorial_cvpr09, andriluka_mpii2d_cvpr14, mehta_mono_3dv17, VNect_SIGGRAPH2017, bulat_convpart_eccv16, chen_nips14, li_maximum_iccv2015, newell_stacked_hourglass_eccv16]. It has many practical applications, for instance in activity recognition, human–machine interaction, or content creation for graphics. There has been much progress in single person 2D pose estimation, and some methods for more challenging 2D multi-person pose estimation were shown. However, work on 3D pose estimation methods has been mostly restricted to single unoccluded subjects. Many natural human activities take place in groups with multiple persons and in cluttered scenes. Monocular input with multiple people therefore not only exhibits self-occlusions of the body, but also strong inter-person occlusions or occlusions by objects, which add to the already difficult under-constrained problem of inferring 3D pose from monocular RGB input. Further, it is very hard to manually annotate or compute 3D ground truth for multi-person training image sets. Consequently, only very few methods have approached this more general 3D multi-person pose estimation problem [rogez_lcr_cvpr17], and it is still largely unsolved. ††This work was funded by the ERC Starting Grant project CapReal (335545). We would also like to thank Hyeongwoo Kim and Eldar Insafutdinov for their assistance in the project.
This paper proposes a new learning-based method to estimate 3D pose of multiple persons in general scenes from monocular input, as well as a new way of creating realistic training data at a large scale. While there are many single-person datasets with ground truth 3D annotations, there are no multi-person datasets that contain realistic human–human interaction with person and background diversity. Computing, let alone manually annotating, such data at scale is difficult because of occlusions and the sheer number of annotations required. Previous approaches to the dataset problem [rogez_lcr_cvpr17] propose using 2D pose data augmented with 3D poses from motion capture datasets, or find 3D consistency in 2D part annotations from a multi-view images [simon2017hand]. In this work, we transform the MPI-INF-3DHP single person dataset [mehta_mono_3dv17] into the first multi-person set with complex interactions, ground truth 3D, and real images of people. This dataset, which we call MuCo-3DHP, is created by compositing multiple 2D person images with ground truth 3D pose from multi-view capture, and varying backgrounds into a single frame. This allows us to controllably generate combinatorially large amounts of real image data for training. There are also very few multi-person test datasets with ground truth annotation [elhayek_convmocap_TPAMI2016] showing more than 2 people. We therefore captured a new multi-person 3D test set with indoor and outdoor scenes, challenging occlusions and interactions, and varying backgrounds, to evaluate our method. All datasets will be made publicly available.
Recent attempts at addressing the 3D multi-person pose estimation problem [rogez_lcr_cvpr17] employ a detection framework to obtain bounding box proposals of each person. This complicates reasoning under occlusion, in strong inter-person interaction, and furthermore induces a runtime penalty when scaling to many persons in a scene. We propose a single shot DNN-based method to extract multi-person 3D pose. It reasons about all people in a scene jointly and does not require explicit bounding box detection proposals [rogez_lcr_cvpr17] which may be unreliable under strong occlusions and expensive to compute in dense multi-person scenes. Our fully-convolutional method infers 2D and 3D joint locations together. It uses an enhanced 3D location map representation [VNect_SIGGRAPH2017] specially tailored to the multi-person case. It allows the readout of a full 3D pose at a detected 2D torso root location, and articulation refinement at selected other joint locations further down the kinematic chain, and thus can jointly infer full 3D pose of multiple people even under partial occlusion. Our main insight is that not all body parts need be visible to make a complete pose inference, but if limbs are visible, they can be used to improve the pose readout from the main torso. Quantitative evaluation shows that estimating 3D pose at the torso root and then refining it at the limbs produces much better pose estimates than other approaches. To sum up, we contribute:
A learning-based single-shot multi-person pose estimation method that predicts both 2D and 3D joint locations without the need for bounding box extraction. Our method is tailored for scenes with occlusion by objects or other people.
The first multi-person dataset of real person images with 3D ground truth that contains complex inter-person occlusions and motion. Our compositing approach enables us to synthesize large amounts of data under user control, for learning based approaches.
A real in-the-wild test set for evaluating multi-person 3D pose estimation methods that contains challenging multi-person interactions, occlusions, and motion.
2 Related Work
In this review, we focus on most directly related work, namely estimating the pose of multiple people in 2D or single person in 3D from monocular RGB. [sarafianos_posesurvey_cviu2016] provide a more comprehensive review. With the exception of [rogez_lcr_cvpr17] ours is the first method for monocular multiple person 3D pose estimation.
Multi-Person 2D Pose Estimation: A common approach for multiple people 2D pose estimation is to first detect single persons and then predict the 2D pose for each detection [pishchulin_reshape_cvpr12, gkioxari2014using, sun2011articulated, iqbal2016multi, papandreou2017towards]. Unfortunately, these methods fail when the detectors fail which is likely to happen in multiple people scenarios with strong occlusions. Hence, a body of work first localizes the joints of each person with CNN-based detectors and then find the correct association between joints and subjects in a post-processing step. The associations are obtained by solving a fully connected graph in [pishchulin_deepcut_cvpr16]. This involves solving an NP-hard integer linear program which easily takes hours per image. The work of [insafutdin_arttrack_cvpr17] improves the performance of [pishchulin_deepcut_cvpr16] by including image-based pairwise terms and using stronger detectors based on ResNet [he_resnet_cvpr2016]. This approach takes minutes instead of hours but it is still computationally very expensive and can only handle a limited number of proposals. Cao et al. [cao_affinity_2017] detect joint locations and Part Affinity Fields (PAFs), which are 2D vectors indicating the direction of bones in the skeleton. By using PAFs and greedy part association they achieve real time mutli-person 2D pose estimation results. Others simultaneously predict joint locations and their associations [newell_associative_nips17] using a stacked-hour glass CNN [newell_stacked_hourglass_eccv16].
Single-Person 3D Pose Estimation: Many monocular single person 3D methods show good performance on standard benchmarks, such as [ionescu_human36_pami14, sigal_humaneva_ijcv10]. Many methods train a discriminative predictor that regresses directly to 3D poses [bo_twin_ijcv10]. However, they often do not generalize well to natural scenes with varied poses, appearances, backgrounds and occlusions. This is due to the fact that most aforementioned 3D datasets are restricted to indoor setups with limited backgrounds. The advent of large real world image datasets with 2D annotations made 2D monocular pose estimation in the wild remarkably accurate. Annotating images with 3D poses is much harder. Hence, recent works have focused on leveraging 2D image datasets for 3D human pose estimation. Some works split the problem in two: they first estimate 2D joints and then lift them to 3D. In the seminal works of [taylor_articulated_cvpr00, sminchisescu2003kinematic] they achieve that by reasoning about kinematic depth ambiguities; in [chen_2d_match_cvpr17, yasin_dual_source_cvpr16] they match detected 2D joints with a database of 3D poses; in [moreno_distance_matrix_cvpr17] they regress pose from a 2D joint distance matrix. Another option is to exploit pose and geometric priors for lifting [zhou_sparseness_deepness_cvpr15, akhter_pose_conditioned_cvpr15, simo_joint_CVPR2013, Jahangiri2017]; in [martinez20173dbaseline] they train a feed forward network to directly predict 3D pose from 2D joints; in [bogo_smpl_eccv16, Lassner:UP:2017] they fit a recently released human body model [SMPL:2015] to 2D detections.
Other works leverage the features learned by a 2D pose estimation CNN for 3D pose estimation, assuming that features discriminative for 2D estimation should be useful in the 3D case as well. For example, in [tekin_fusion_arxiv16] they learn to merge features from a 2D joint prediction network and a 3D joint prediction network. Another approach is to train a network with separate 2D and 3D losses for the different data sources [popa2017deep, zhou2017towards, sun2017compositional]. The advantage of such methods is that they can be trained end to end. A simpler yet very effective approach is to refine a network trained for 2D pose estimation for the task of 3D pose estimation [VNect_SIGGRAPH2017, mehta_mono_3dv17]. A major limitation of methods that rely on 2D joint detections directly or for bounding boxes is that they easily fail under body occlusion or if some of the 2D detections are incorrect, both of which are common in multi-person scenes. In contrast, our approach is more robust to occlusions since the complete global 3D pose can be read out at the first non-occluded location of pelvis, spine or neck. As shown in [VNect_SIGGRAPH2017], 3D joint prediction works best when the prediction is centered at the 2D joint of interest.
Multi-Person 3D Pose Estimation: To our knowledge, only Rogez et al. [rogez_lcr_cvpr17] tackle multi-person 3D pose estimation from one image. Their method uses a pipeline consisting of localization, classification and regression. They first identify proposals of bounding boxes likely to contain a person using [ren_faster_rcnn_nips15]. Instead of regressing to pose directly, they then classify the bounding box into a set of K-poses, which is similar to [posebits_cvpr14]. These poses are scored by a classifier and refined using a regressor. All three components share the convolutional feature layers and are trained jointly. However, the method still reasons using bounding boxes internally and produces multiple proposals per subject that need to be accumulated and fused. Results with severe person-person occlusions are not shown. In contrast, our approach uses a fully-convolutional network, and produces multi-person 2D joint locations and 3D location maps in a single shot, from which the 3D pose can be inferred after grouping the 2D joint detections by people. 3D Pose Datasets: Existing pose datasets are either for a single person in 3D [ionescu_human36_pami14, sigal_humaneva_ijcv10, trumble2017total, vonPon2016a, mehta_mono_3dv17] or multi-person with only 2D pose annotations [andriluka_mpii2d_cvpr14, lin_coco_eccv14]. One exception is the MARCOnI dataset [elhayek_convmocap_TPAMI2016] that features 5 sequences but contains only 2 persons simultaneously and there are no close interactions. We choose to leverage the person segmentation masks available in MPI-INF-3DHP [mehta_mono_3dv17] to generate annotated multi-person 3D pose images of real people through compositing. The ground truth annotations for each person were obtained through multivew marker-less motion capture [mehta_mono_3dv17]. Then we compose images of multiple people by stacking layers with different people simulating people-people occlusions, see Section 3.
3 Multi-Person Dataset
As discussed, single person image data sets with 3D pose annotation at scale and with sufficient appearance diversity were generated. Previous work used a combination of transfer learning [mehta_mono_3dv17, zhou2017towards] and appearance augmentation [mehta_mono_3dv17] with marker-based [ionescu_human36_pami14] and marker-less [mehta_mono_3dv17] indoor motion capture. At first it may seem trivial to extend these concepts to the multi-person case, i.e., use a combination of in-the-wild multi-person 2D pose data [andriluka_mpii2d_cvpr14, lin_coco_eccv14] and multi-person multi-view motion capture for 3D annotation. However, multi-person 3D motion capture under strong occlusions and difficult interactions is still challenging even for commercial multi-view systems. In such scenes, manual pose correction is often needed, and 3D accuracy is thus constrained. This severely limits the scale at which real multi-person data can be captured and processed.
Hence, we employ multi-view marker-less motion capture only to create the 20 sequences of the first expressive in-the-wild test set for multi-person 3D pose estimation. For the much larger training set MuCo-3DHP, however, we resort to a new compositing and augmentation scheme that leverages the single-person image data of real people in MPI-INF-3DHP[mehta_mono_3dv17] to composite an arbitrary number of real multi-person interaction images with captured ground truth 3D under user control.
3.1 MuCo-3DHP: Compositing-Based Training Set
The recently released MPI-INF-3DHP [mehta_mono_3dv17] single person 3D pose dataset provides marker-less motion capture based annotations for real images of 8 subjects, each captured with 2 clothing sets, using 14 cameras with different elevations. We leverage the person segmentation masks to create per-camera composites with 1 to 4 subjects, with frames randomly selected from the sequences available per camera. Since we have ground truth 3D skeleton pose for each video subject in the same space, compositing can be done in a 3D-aware way, resulting in correct depth ordering and overlap of the composited subjects, without any interpenetration of 3D bounding boxes. We refer to this composited training set as the Multiperson Composited 3D Human Pose dataset. Example composites are shown in Fig.1. The compositing process results in plausible images covering a range of simulated inter-person overlap and activity scenarios. User-control over desired pose and occlusion distributions during synthesis, and further FG/BG augmentation using the masks provided with MPI-INF-3DHP, is possible. For details on further processing applied to MuCo-3DHP dataset while training, please refer to the supplementary document. Even though the synthesized composites may not simulate all fine-grained aspects of human-human interaction fully, our approach trained on these data generalizes well to real world scenes shown in our test set.
3.2 Test Set
We provide a new filmed, not composited, multi-person test set comprising 20 general real world scenes with ground truth 3D pose for up to three subjects obtained with a multi-view marker-less motion capture system [Captury]. The set covers 5 indoors and 15 outdoor settings, with stationary and moving backgrounds, trees, office buildings, road, people, vehicles, and other distractors in the background. Additionally, some of the outdoor footage have challenging elements, e.g., drastic illumination changes, and lens flare. The indoor sequences use footage at px resolution at 30fps, outdoor sequences were captured with GoPros at px resolution at 60fps. The test set consists of 8000 frames, split among the 20 sequences, with 8 subjects, in a variety of clothing styles, poses, interactions, and activities. A key feature is that the test sequences do not resemble the training sequences, and include real interaction scenarios.
Evaluation Metric: We use the robust 3DPCK evaluation metric proposed in [mehta_mono_3dv17]. It treats a joint’s prediction as correct if it lies within a 15cm ball centered at the ground truth joint location, and is evaluated for the common minimum set of 14 joints marked in green in Figure 3. We report the 3DPCK numbers per sequence, averaged over the subjects for which GT reference is available. Occluded joints or subjects are not excluded from the evaluation.
Location Maps: In previous work [VNect_SIGGRAPH2017] it has been observed that 3D pose inference can be linked more strongly to image evidence by inferring 3D joint positions at the respective 2D joint pixel locations using a fully convolutional neural network. This forces the network to focus on the image evidence around the 2D joint when inferring its 3D counterpart. This is achieved using location maps. A location map for a joint is a 2D feature channel with each of its 2D pixel locations storing the most likely , , or coordinate for that joint, conditional on the 2D prediction for that joint being at that 2D pixel location. For an input image of size , location maps (each of size ) are used to store the 3D location of all joints. Location maps are trained to produce reliable 3D predictions at image locations where 2D joints are detected. Hence, at test time, the 3D joint location is read out at the corresponding 2D detection. Per-joint location inference as proposed in [VNect_SIGGRAPH2017], enables full 3D pose inference only if the person is completely visible. It therefore breaks down when joints are occluded, which happens often in general scenes. Self-occlusions, person-person occlusions, occlusions by objects, and body truncation at frame boundaries are common in multiple persons scenes. In the following, we detail our formulation and our solution to handle these challenges.
Given a monocular RGB image , we seek to estimate the 3D pose for each of the persons in the image. Here, describes the 3D locations of the () body joints of person . The joint locations are encoded relative to their reference joints, marked with arrows in Figure 3. We make use of 2D joint heatmaps predicted by our network to encode the detection confidence of each joint type in the image. Additionally, we predict part affinity fields which encode a 2D vector field for each body part denoting the direction pointing from the parent joint to its child [cao_affinity_2017]. This facilitates association of 2D detections to person identities when there are multiple people in the scene.
The 3D locations of each joint that our network predicts are encoded in the location maps denoted by , , and , where , , and . Note that we predict a fixed number of maps ( heatmaps, location maps, and part affinity fields) irrespective of the number of persons in the scene making our method scale without additional processing.
Occlusion-Robust Location Maps: At the core of our method is a carefully designed encoding of pose for multiple persons which we call Occlusion-Robust Location Maps (ORLM). ORLMs have two special features: (1) they support a special read out scheme (see Section 4.2) that makes our method robust to partial occlusions of the body, (2) they encode the pose of multiple persons without needing a variable number of outputs.
To support our special read out scheme, we decompose the body into torso, four limbs, and head (see Figure 3). We denote as full pose the vector containing all joint locations. We denote as limb pose the part of the pose parameters corresponding to the limb, e.g., the limb-pose of the left arm is a vector of 6 parameters consisting of two 3D vector offsets: shoulder–elbow, and elbow–wrist. Given this decomposition, the ORLM are trained such that (see Figure 4):
At the root and neck location the full pose can be read out, and
At the wrist, elbows, ankles and knees location the corresponding limb pose can be read out.
Notice that ORLM have redundancy built in to better deal with occlusions at inference time. For instance, the 3D location of the left elbow can be read out at four different pixel locations, namely, at the neck, root, wrist and the elbow 2D location (see Figure 4). Therefore, if a particular joint is occluded in the image we read the pose information at a different joint as explained in Section 4.2.
While in the original formulation location maps encode the 3D pose for only a single person, ORLM encode the 3D pose for all persons jointly without adding more channels. As shown in Figure 4, during training, we encode the the full 3D pose of multiple persons within the location maps. This ensures efficient pose inference even when multiple persons are visible without needing variable outputs.
4.2 Pose Inference
3D pose inference of multiple people from ORLM is predicated on successful 2D joint location inference and association.
2D Pose Inference: We infer 2D joint locations , and joint detection confidences for each person in the image. Explicit 2D joint-to-person association is done with the predicted heatmaps and part affinity fields using the approach of Cao et al. [cao_affinity_2017].
3D Pose Inference with ORLM : We use the 2D body joint locations and the body joint detection confidences to infer the 3D pose of all persons in the scene. Algorithm 1 describes the 3D pose inference process which is also visually explained in Figure 3. Since occlusions occur, the naive approach reading of 3D joint locations at 2D detections completely fails. To that end, we propose two strategies to handle occlusions: (1) read out priority and (2) 2D joint validation.
Read Out Priority: By virtue of the ORLM we can read out joint predictions at different pixel locations which makes us robust to occlusions. Let us denote as extremity joints: the wrists, and ankles, and as middle-joints: the elbows and knees. The neck and the root joint 2D detections are usually reliable, these joints are most often not occluded and lie in the middle of the body. Therefore, we start reading the full pose at the neck location. If the neck is invalid (as defined below) then the full pose is read at the pelvis. If both joints are invalid we consider that person is not visible in the scene and we do not predict their pose. While robust, full poses read at the pelvis and neck tend to be closer to the average pose in the training data. Therefore, for every limb, we continue by reading out the limb pose at the extremity joint. If the extremity joint is valid, the limb pose replaces the corresponding elements of the full pose. If the extremity joint is invalid we try to read out the limb pose at the middle joint. If the middle joint is valid the limb pose replaces the corresponding elements of the full pose. If the middle joint is also invalid, the prediction for the limb pose will come from the neck/pelvis full pose read out. The procedure is illustrated in Figure 3. This inference strategy with priority makes us robust to occlusions.
2D Joint Validation: We check a selected 2D joint and mark it as valid if it satisfies 2 conditions: (1) is unoccluded, i.e. has confidence value higher than a threshold, and (2) is sufficiently far away from similar joints on another person. If both conditions are satisfied, we lookup the corresponding 3D pose based on the read out priority. Otherwise, we fall back to a limb joint higher up in the hierarchy (e.g., ankle for leg, elbow for arm) until the above conditions are satisfied. The redundancies and fall backs incorporated in our pose inference algorithm ensure reliable pose estimation in the presence of strong inter-person occlusions. Even if none of the limb joints are visible, we can still estimate a reasonable 3D pose based solely on the torso readout.
4.3 Network Details
Our network is based on ResNet-50 [he_resnet_cvpr2016]. The original architecture is preserved till res4f, after which we split it into two—a 2DPose+Affinity stream and a 3DPose stream. Architectural specifics of the two streams may be found in the supplemental document. The 2DPose+Affinity stream predicts the 2D heatmaps for the MS-COCO body joint set, and part affinity fields .
The 3DPose stream predicts 3D pose location maps , and , as well as 2D heatmaps for the VNect (MPI-INF-3DHP [mehta_mono_3dv17]) joint set, which has some overlap with the MS-COCO joint set, but does not include facial keypoint annotations and includes annotations for hands, toes and spine. For limb pose read out locations as described in the preceding section, we restrict ourselves to the common minimum joint set between the two, as indicated by the circles in Figure 3.
Training: We start with a Resnet-50-based architecture trained for single person 2D pose estimation on LSP [johnson_lsp_bmvc10, johnson_lspet_cvpr11] and MPI [andriluka_mpii2d_cvpr14] datasets and use that to initialize the network till res4f. We train without the 3DPose branch on MS-COCO [lin_coco_eccv14] multi-person 2D pose data in accordance with Cao et al.[cao_affinity_2017]. We then freeze the weights of the core network and the 2DPose+Affinity branch, and train the 3DPose branch on our MuCo-3DHP data for 360k iterations with a batch size of 6. More details on the training can be found in the supplementary document.