Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach.
The automatic acquisition of detailed 3D human shape and appearance, including clothing and facial details is required for many applications such as VR/AR, gaming, virtual try-on, and cinematography.
A common way to acquire such models is with a scanner or a multi-view studio [3, 46]. The cost and size prevent the wide-spread use of such setups. Therefore, numerous works address capturing body shape and pose with more practical setups, e.g. from a low number of video cameras , or using one or more depth cameras, either specifically for the human body [9, 77, 84] or for general free-form surfaces [88, 51, 54, 37, 26, 70]. The most practical but also challenging setting is capturing from a single monocular RGB camera. Some methods attempt to infer the shape parameters of a body model from a single image [41, 53, 10, 24, 8, 32, 86, 39, 55], but reconstructed detail is constrained to the model shape space, and thus does not capture personalized shape detail and clothing geometry. Recent work [6, 5] estimates more detailed shape, including clothing, from a video sequence of a person rotating in front of a camera while holding a rough A-pose. While reconstructed models have high quality, the approach takes around minutes only for the shape component and requires a small amount of manual interaction. This is impractical for many applications that require fast acquisition such as telepresence and gaming. The main bottleneck is the pre-processing step, which requires fitting the SMPL model to each of the frame silhouettes using time-consuming non-linear optimization ( min for frames).
In this work, we address these limitations and introduce Octopus, a convolutional neural network (CNN) based model that learns to predict 3D human shapes in a canonical pose given a few frames of a person rotating in front of a single camera. Octopus predicts using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. It can make bottom-up predictions in ms per view, which are effectively refined top-down using the same images in s. Inference, both bottom-up and top-down, is performed fully-automatically using the same model. Octopus is therefore easy to use and more practical than previous work . Learning only relies on synthetic 3D data, and on semantic segmentation images and keypoints derived from synthesized video sequences. Consequently, Octopus can be trained without paired data—real images with ground truth 3D shape annotations—which is very difficult to obtain in practice.
Octopus predicts SMPL body model parameters, which represent the undressed shape and the pose, plus additional 3D vertex offsets that model clothing, hair, and details beyond the SMPL space. Specifically, a CNN encodes frames of the person (in different poses) into latent codes that are fused to obtain a single shape code. From the shape code, two separate network streams predict the SMPL shape parameters, and the 3D vertex offsets in the canonical T-pose space, giving us the “unpose” shape or T-shape. Predicting the T-shape forces the latent codes to be pose-invariant, which is necessary to fuse the shape information contained in each frame. Octopus also predicts a pose for each frame, which allows to “pose” the T-shape and render a silhouette to evaluate the overlap against the input images in a top-down manner during both training and inference. Specifically, since bottom-up models do not have a feedback loop, the feed-forward 3D predictions are correct but do not perfectly align with the input images. Consequently, we refine the prediction top-down by optimizing the poses, the T-shape, and the vertex offsets to maximize silhouette overlap and joint re-projection error.
Experiments on a newly collected dataset (LifeScans), the publicly available PeopleSnapshot dataset , and on the dataset used in  demonstrate that our model infers shapes with a reconstruction accuracy of mm in less than seconds. In summary, Octopus is faster than purely optimization-based fitting approaches such as , it combines the advantages of bottom-up and top-down methods in a single model, and can reconstruct detailed shapes and clothing from a few video frames, examples of reconstruction results are shown in Fig. LABEL:fig:teaser. To foster further research in this direction, we made Octopus available for research purposes .
2 Related Work
Methods for 3D human shape and pose reconstruction can be broadly classified as top-down or bottom-up. Top-down methods either fit a free-form surface or a statistical body model (model-based). Bottom-up methods directly infer a surface or body model parametrization from sensor data. We will review bottom-up and top-down methods for human reconstruction.
methods non-rigidly deform meshes [14, 22, 12] or volumetric shape representations [36, 4]. These methods are based on multi-view stereo reconstruction , and therefore require multiple RGB or depth cameras, which is a practical barrier for many applications. Using depth cameras, KinectFusion [38, 52] approaches reconstruct 3D scenes by incrementally fusing frame geometry, and appearance , in a canonical frame. Several methods build on KinectFusion for body scanning [64, 47, 82, 20]. The problem is that these methods require the person to stand still while the camera is turned around. DynamicFusion  generalized KinectFusion to non-rigid objects by combining non-rigid tracking and fusion. Although template-free approaches [52, 37, 65] are flexible, they can only handle very careful motions. Common ways to add robustness are pre-scanning the template , or using multiple kinects [26, 54] or multi-view [67, 44, 19]. These methods, however, do not register the temporal 3D reconstructions to the same template and focus on other applications such as streaming or telepresence . Estimating shape by compensating for pose changes can be tracked back to Cheung et al. [17, 18], where they align visual hulls over time to improve shape estimation. To compensate for articulation, they merge shape information in a coarse voxel model. However, they need to track each body part separately and require multi-view input. All free-form works require multi-view input, depth cameras or cannot handle moving humans.
methods exploit a parametric body model consisting of pose and shape [7, 33, 48, 89, 57, 40] to regularize the fitting process. Some Depth-based methods [77, 34, 79, 84, 9] exploit the temporal information by optimizing a single shape and multiple poses (jointly or sequentially). This leads to expensive optimization problems. Using mutli-view, some works achieve fast performance [60, 61] at the cost of using a coarser body model based on Gaussians , or a pre-computed template . Early RGB-based methods were restricted to estimating the parameters of a body model, and required multiple views  or manually clicked points [30, 86, 39, 63]. Shape and clothing have been recovered from RGB images [31, 15], depth , or scan data , but require manual intervention or clothing is limited to a pre-defined set of templates. In  a fuzzy vertex association from clothing to body surface is introduced, which allows complex clothing modeled as body offsets. Some works are in-between free-form and model-based methods. In [27, 76], authors pre-scan a template and insert a skeleton, and in  authors combine the SMPL model with a volumetric representation to track the clothed human body from a depth camera.
Learning of features for multi-view photo-consistency , and auto-encoders combined with visual hulls [28, 72] have shown to improve free-form performance capture. These works, however, require more than one camera view. Very few works learn to predict personalized human shape from images–lack of training data and the lack of a feedback loop between feed-forward predictions and the images makes the problem hard. Variants of random forests and neural networks have been used [24, 23, 25, 75] to regress shape from silhouettes. The problem here is that predictions tend to look over-smooth, are confined to the model shape space, and do not comprise clothing. Garments are predicted  from a single image, but a single model for every new garment needs to be trained, which makes it hard to use in practice. Recent pure bottom-up approaches to human analysis [50, 49, 58, 87, 69, 71, 62] typically predict shape represented as a coarse stick figure or bone skeleton, and can not estimate body shape or clothing.
A recent trend of works combines bottom-up and top-down approaches–a combination that has been exploited already in earlier works . The most straightforward way is by fitting a 3D body model  to 2D pose detections [10, 43]. These methods, however, can not capture clothing and details beyond the model space. Clothing, hair and shape [6, 5] can be inferred by fusing dynamic silhouettes (predicted bottom-up) of a video to a canonical space. Even with good 2D predictions, these methods are susceptible to local minima when not initialized properly, and are typically slow. Furthermore, the 2D prediction network and the model fitting is de-coupled. Starting with a feed-forward 3D prediction, semantic segmentation, keypoints and scene constraints are integrated top-down in order to predict the pose and shape of multiple people . Other recent works integrate the SMPL model, or a voxel representation , as a layer within a network architecture [41, 55, 53, 73]. This has several advantages: (i) predictions are constrained by a shape space of humans, and (ii) bottom-up 3D predictions can be verified top-down using 2D keypoints and silhouettes during training. However, the shape estimates are confined to the model shape space and tend to be close to the average. The focus of these works is rather on robust pose estimation, while we focus on personalized shapes. We also integrate SMPL within our architecture but our work is different in several aspects. First, our architecture fuses the information of several images of the same person in different poses. Second, our model incorporates a fast top-down component during training and at test time. As a result, we can predict clothing, hair and personalized shapes using a single camera.
The goal of this work is to create a 3D model of a subject from a few frames of a monocular RGB video, and in less than seconds. The model should comprise body shape, hair, clothing and should be animatable. We take inspiration from  and focus on the cooperative setting with videos of people rotating in front of a camera holding a rough A-pose–this motion is easy and fast to perform, and ensures that non-rigid motion of clothing and hair is not too large. In contrast to previous work , we aim for fast and fully automatic reconstruction. To this end, we train a novel convolutional neural network to infer a 3D mesh model of a subject from a small number of input frames. Additionally, we train the network to reconstruct the 3D pose of the subject in each frame. This allows us to refine the body shape by utilizing the decoder part of the network for instance-specific optimization (Fig. 1).
In Sec. 3.1 we describe the shape representation used in this work followed by its integration into the used predictor (Sec. 3.2). In Sec. 3.3 we explain the losses, that are used in the experiments. We conclude by describing the instance-specific top-down refinement of results (Sec. 3.4).
3.1 Shape representation
Similar to previous work [83, 6], we represent shape using the SMPL statistical body model , which represents the undressed body, and a set of offsets modeling instance specific details including clothing and hair.
SMPL is a function that maps pose and shape to a mesh of vertices. By adding offsets to the template , we obtain a posed shape instance as follows:
where linear blend-skinning with weights , together with pose-dependent deformations allow to pose the T-shape () based on its skeleton joints . SMPL plus offsets, denoted as SMPL+D, is fully differentiable with respect to pose , shape and free-form deformations . This allows us to directly integrate SMPL as a fixed layer in our convolutional architecture.
3.2 Model and data representation
Given a set of images depicting a subject from different sides with corresponding 2D joints , we learn a predictor that infers the body shape , personal and scene specific body features , and 3D poses along with 3D positions for each image. is a CNN parametrized by network parameters .
Input modalities. Images of humans are highly diverse in appearance, requiring large datasets of annotated images in the context of deep learning. Therefore, to abstract away as much information as possible while still retaining shape and pose signal, we build on previous work [29, 13] to simplify each RGB image to a semantic segmentation and 2D keypoint detections. This allows us to train the network using only synthetic data and generalize to real data.
Model parametrization. By integrating the SMPL+D model (Sec. 3.1) into our network formulation, we can utilize its mesh output in the training of . Concretely, we supervise predicted SMPL+D parameters in three ways: Imposing a loss directly on the mesh vertices , on the predicted joint locations and their projections on the image, and densely on a rendering of the mesh using a differential renderer .
The T-shape () in Eq. 2 is now predicted from the set of semantic images with the function:
where are the regressors to be learned. Similarly, the mesh posed is predicted from the image and 2D joints with the function:
from which the 3D Joints are predicted with the linear regressor :
has been trained to output joint locations consistent with the Body_25  keypoint ordering. The estimated posed mesh can be rendered in uniform color with the image formation function paramerized by camera :
Similarly, we can project the the joints to the image plane by perspective projection :
All these operations are differentiable, which we can conveniently use to formulate suitable loss functions.
3.3 Loss functions
Our architecture permits two sources of supervision: (i) 3D supervision (in our experiments, from synthetic data derived by fitting SMPL+D to static scans), and (ii) 2D supervision from video frames alone. In this section, we discuss different loss functions used to train the predictors .
Losses on body shape and pose For a paired sample in the dataset we use the following losses between our estimated model and the ground truth model scan:
Per-vertex loss in the canonical T-pose . This loss provides a useful 3D supervision on shape independently of pose:
Per-vertex loss in posed space. This loss supervises both pose and shape on the Euclidean space:
where is the binary segmentation mask and is the image formation function defined in Eq. 7. is a weakly supervised loss as it does not require 3D annotations and can be estimated directly from RGB images. In the experiments, we investigate whether such self-supervised loss can reduce the amount 3D supervision required (see 4.4). Additionally, we show that can be used at test time to refine the bottom-up predictions and capture instance specific details in a top-down manner (see 3.4).
Per-vertex SMPL undressed body loss:
The aforementioned losses only penalize the final SMPL+D 3D shape. It is useful to include an ”undressed-body” loss to force the shape parameters to be close to the ground truth
where are vectors of length . This also prevents that the offsets explain the overall shape of the person.
Pose specific losses. In addition to the posed space and silhouette overlap losses, we train for the pose using a direct loss on the predicted parameters
where are vectorized rotation matrices of the 24 joints. Similar to [53, 43, 55], we use differentiable SVD to force the predicted matrices to lie on the manifold of rotation matrices. This term makes the pose part of the network converge faster.
Losses on joints. We further regularize the pose training by imposing a loss on the joints in Euclidean space:
Similar to the 2D image projection loss on model (Eq. 11), we also have a weakly supervised 2D joint projection loss
3.4 Instance-specific top-down optimization
The bottom-up predictions of the neural model can be refined top-down at test time to capture instance specific details. It is important to note that this step requires no 3D annotation as the network fine-tunes using only 2D data. Specifically, at the test time, given a subject’s images and 2D joints we optimize a small set of layers in using image and joint projection losses (see 4.1). By fixing most layers of the network and optimizing only latent layers, we find a compromise between the manifold of shapes learned by the network and new features, that have not been learned. We further regularize this step using Laplacian smoothness, face landmarks, and symmetry terms from . Table 1 illustrates the performance of the pipeline before and after optimization (see 4.2, 4.3).
The following section focuses on the evaluation of our method. In Sec. 4.1 we introduce technical details of the used dataset and network architecture. The following sections describe experiments for quantitative and qualitative evaluation as well as ablation and parameter analysis.
4.1 Experimental setup
To alleviate the lack of paired data, we use 615 static 3D scans of people in clothing. We purchased 150 scans from renderpeople.com. 465 scans were kindly provided from Twindom (https://web.twindom.com/). Unfortunately, in the 615 there is not enough variation in pose and shape to learn a model that generalizes. Hence, we generate synthetic 3D data by non-rigidly registering SMPL+D to each of the scans. This allows us to change the underlying body shape and pose of the scan using SMPL, see Fig. 2. Like , we focus on a cooperative scenario where the person is turning around in front of the camera. Therefore, we animate the scans with turn-around poses and random shapes and render video sequences from them. We call the resulting dataset LifeScans, which consists of rendered images paired with 3D animated scans in various shapes and poses. Since the static scans are from real people, the generated images are close to photo-realistic, see Fig 2. To prevent overfitting, we use semantic segmentation together with keypoints as intermediate image representation, which preserve shape and pose signatures while abstracting away appearance. This reduces the amount of appearance variation required for training. To be able to render synthetic semantic segmentation, we first render the LifeScans subjects from different viewpoints and segment the output with the method of . Then we project the semantic labels back in the SMPL texture space and fuse different views using graph cut-based optimization. This final step enables full synthetic generation of paired training data.
Scale is an inherent ambiguity in monocular imagery. Three factors determine the size of an object in an image: distance to the camera, camera intrinsics, and the size of the object. As it is not possible to decouple this ambiguity in a monocular set-up with moving objects, we fix two factors and regress one. In other works [53, 41, 55] authors have assumed fixed distance to the camera. We cannot make this assumption, as we leverage multiple images of the same subject, where the distances to the camera may vary. Consequently, we fix the size of the subject to average body height. Precisely, we make SMPL height independent, by multiplying the model by m divided by the y-axis distance of vertices describing ankles and eyes. Finally, we fix the focal length to sensor height.
In the following we describe details of the convolutional neural network . An overview is given in Fig. 3. The input to is a set of xpx semantically segmented images and corresponding 2D joint locations . encodes each image with a set of five, x convolutions with ReLU activations followed by x max-pooling operations into a pose invariant latent code . In our experiments we fixed the size of to . The pose branch maps both joint detections and output of the last convolutional layer to a vector of size and finally to the pose-dependent latent code of size via fully connected layers. The shape branch aggregates pose invariant information across images and computes mean . Note that this formulation allows us to aggregate pose-dependent and invariant information across an arbitrary and varying number of views. The shape branch goes on to predict SMPL shape parameters and free-form deformations on the SMPL mesh. is directly calculated from with a linear layer. In order to predict per-vertex offsets from the latent code , we use a four-step graph convolutional network with Chebyshev filters and mesh upsampling layers similar to . Each convolution is followed by ReLU activation. We prefer a graph convolutional network over a fully connected decoder due to memory constraints and in order to get structured predictions.
The proposed method, including rendering, is fully differentiable and end-to-end trainable. Empirically we found it better to train the pose branch before training the shape branch. Thereafter, we optimize the network end-to-end. We use a similar training schedule for our pose branch as , where we first train the network using losses on the joints and pose parameters () followed by training using losses on the vertices and pose parameters (). We also experiment with various training schemes, and show that weakly supervised training can significantly reduce the dependence on 3D annotated data (see Sec. 4.4). For that experiment, we train the model with alternating full () and weak supervision (). During instance-specific optimization we keep most layers fixed and only optimize latent pose , latent shape and the last graph convolutional layer, that outputs free-form displacements .
4.2 Numerical evaluation
We quantitatively evaluate our method on a separated test set of the LifeScans dataset containing 25 subjects. We use semantic segmentation images and 2D poses as input and optimize the results for a maximum budget of seconds. All results have been computed without intensive hyper-parameter tuning. To quantify shape reconstruction accuracy, we adjust the pose of the estimation to match the ground truth, following [83, 9]. This disentangle errors in pose from errors in shape and allows to quantify shape accuracy. Finally, we compute the bi-directional vertex to surface distance between scans and reconstructions. We report mean errors in millimeters (mm) across the test set in Tab. 1. We differentiate between full method and ground truth (GT) poses. Full method refers to our method as described in Sec. 4.1. The latter is a variant of our method that uses ground truth poses, which allows to study the effect of pose errors. In Fig. 4 we display subjects in the test set for both variants along with per-vertex error heatmaps. Visually the results look almost indistinguishable, which is corroborated by the fact that the numerical error increases only by mm between GT and predicted pose models. This demonstrates the robustness of our approach. We show more examples with the corresponding texture for qualitative assessment in Fig. LABEL:fig:teaser. The textures have been computed using graph cut-based optimization using semantic labels as described in .
|Before optimization||After optimization|
|Full Method||5.82 5.54||5.22 5.07|
|GT Poses||5.88 5.55||4.90 5.05|
4.3 Analysis of key parameters
Our method comes with two key hyper-parameters, namely number of input images , and number of optimization steps. In the following section, we study these parameters and how they affect the performance of our approach. We also justify our design choices.
Fig. 6 illustrates the performance of our method with growing number of optimization steps. While the performance gain saturates at around steps, we use steps in following experiments as a compromise between accuracy and speed. For the case of input images optimization for steps takes s on a single Volta V100 GPU. We believe s is a practical waiting time and a good compromise for many applications. Therefore we fix the time budget to s for the following experiments.
Including more input views at test time can potentially improve the performance of the method. However, in practice, this means more data pre-processing and longer inference times. Fig. 7 illustrates the performance with different number of input images. Perhaps surprisingly, the performance saturates already at around 5 images before optimization. After optimization, a different picture emerges. The error in the full method slightly increases for larger number of views, while it still goes down for GT poses. This can be explained with the fixed time budget in this experiment. Small errors in pose cannot be corrected and are erroneously compensated by the shape. This is not a major problem since our method produces good results given even only a single input image. While we could potentially use fewer images, we found views as a practical number of input views. This has the following reason: A calculated avatar should not only be numerically accurate but also visually appealing. Results based on more number of views show more fine details and most importantly allow accurate texture calculation.
4.4 Type of supervision
Since videos are easier to obtain than 3D annotations, we evaluate to which extent they can substitute full 3D supervision to train our network. To this end, we split the LifeScans dataset. One part is used for full supervision, the other part is used for weak supervision in form of image masks and 2D keypoints. All forms of supervision can be synthetically generated from the LifeScans dataset. We train with 10%, 20%, 50%, and 100% full supervision and compare the performance on the test set in Tab. 2. In order to factor out the effect of problematic poses during the training, we used ground truth poses in this experiment. The results suggest that can be trained with only minimal amount of full supervision, given strong pose predictions. The performance of the network decreases only slightly for less than 100% full supervision. Most interestingly, the results are almost identical for 10%, 20%, and 50% full supervision. This experiment suggests that we could potentially improve performance by supervising our model with additionally recorded videos. We leave this for future work.
|Before optimization||After optimization|
|100%||5.88 5.55||4.90 5.05|
|50%||6.42 6.10||5.65 5.71|
|20%||6.51 6.04||5.27 5.30|
|10%||6.57 6.06||5.56 5.45|
4.5 Qualitative results and comparisons
We qualitatively compare our method against the most relevant work  on their PeopleSnapshot dataset. While their method leverages 120 frames, we still use frames for our reconstructions. For a fairer comparison, we optimize for s in this experiment. This is still several magnitudes faster than the min needed by . Their method needs minutes for shape optimization plus minute per frame for the pose. In Fig. 5 we show side-by-side comparison to . Our results are visually still on par while requiring a fraction of the data.
We also compare our method against , a RGB-D based optimization method. Their dataset displays subjects in minimal clothing rotating in front of the camera in T-pose. Unfortunately, the semantic segmentation network is not able to successfully segment subjects in minimal clothing. Therefore we sightly change the set-up for this experiment. We segment their dataset using the semi-automatically approach  and re-train our predictor to be able to process binary segmentation masks. Additionally, we augment the LifeScans dataset with T-poses. We show side-by-side comparisons in Fig. 8. Again our results are visually similar, despite the use of less and only monocular data.
5 Discussion and Conclusion
We have proposed a novel method for fully automatic 3D body shape estimation from only frames of a monocular video of a person moving. Our novel predictor Octopus allows computing mesh-based pose invariant shape predictions and per-image pose estimations from a flexible number of views. Further, Octopus can refine these predictions to an accuracy of mm relying only on the 2D input data. In summary, we improve over the state-of-the-art in the following aspects: Our method allows, for the first time, to estimate detailed full body reconstructions of people in clothing in a fully automatic manner. We significantly reduce the number of needed images at test time and compute the final result several magnitudes faster than state-of-the-art. Extensive experiments on the LifeScans dataset demonstrate the performance and the influence of the key parameters of the predictor. While our model is independent on the number of input images and can be refined for different numbers of optimization steps, we have shown that using views and refining for seconds are good compromises between accuracy and practicability. Qualitative results on two real-world datasets demonstrate the applicability and robustness of our method to real data, despite being trained from synthetic data alone.
Future work should enable the proposed method for scenarios where the subject is not cooperating. This would e.g. allow reconstructing virtual actors from legacy movie material, or from YouTube videos. Furthermore, the current model formulation is limiting possible use cases. Clothing with topologies different than the undressed body shape, such as skirts and coats or hairstyles like ponytails cannot be modeled as displacements.
By enabling fully automatic 3D body shape reconstruction from a handful images in only a few seconds, we prepare the ground for wide-spread acquisition of personalized 3D avatars. People are now able to quickly digitize themselves using only a webcam and can use their model for various VR and AR applications.
The authors gratefully acknowledge funding by Deutsche Forschungsgemeinschaft (DFG. German Research Foundation) from projects MA2555/12-1 and 409792180. We would like to thank Twindom for providing us with the scan data. Another thanks goes to Verica Lazova for great help in data processing.
-  https://github.com/cmu-perceptual-computing-lab/openpose.
-  http://virtualhumans.mpi-inf.mpg.de/octopus/.
-  Naveed Ahmed, Edilson de Aguiar, Christian Theobalt, Marcus Magnor, and Hans-Peter Seidel. Automatic generation of personalized human avatars from multi-view video. In Proc. of the ACM Symposium on Virtual Reality Software and Technology, VRST ’05, pages 257–260, New York, NY, USA, 2005. ACM.
-  Benjamin Allain, Jean-Sébastien Franco, and Edmond Boyer. An Efficient Volumetric Framework for Shape Tracking. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 268–276, Boston, United States, 2015. IEEE.
-  Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Detailed human avatars from monocular video. In International Conf. on 3D Vision, sep 2018.
-  Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video based reconstruction of 3D people models. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
-  Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. SCAPE: shape completion and animation of people. In ACM Transactions on Graphics, volume 24, pages 408–416. ACM, 2005.
-  Alexandru O Bălan and Michael J Black. The naked truth: Estimating body shape under clothing. In European Conf. on Computer Vision, pages 15–29. Springer, 2008.
-  Federica Bogo, Michael J. Black, Matthew Loper, and Javier Romero. Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In IEEE International Conf. on Computer Vision, pages 2300–2308, 2015.
-  Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conf. on Computer Vision. Springer International Publishing, 2016.
-  Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Cedric Cagniart, Edmond Boyer, and Slobodan Ilic. Probabilistic deformable surface tracking from multiple videos. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, European Conf. on Computer Vision, volume 6314 of Lecture Notes in Computer Science, pages 326–339, Heraklion, Greece, 2010. Springer.
-  Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Joel Carranza, Christian Theobalt, Marcus A Magnor, and Hans-Peter Seidel. Free-viewpoint video of human actors. In ACM Transactions on Graphics, volume 22, pages 569–577. ACM, 2003.
-  Xiaowu Chen, Yu Guo, Bin Zhou, and Qinping Zhao. Deformable model for estimating clothed and naked human shapes from a single image. The Visual Computer, 29(11):1187–1196, 2013.
-  Xiaowu Chen, Bin Zhou, Feixiang Lu, Lin Wang, Lang Bi, and Ping Tan. Garment modeling with a depth camera. ACM Transactions on Graphics, 34(6):203, 2015.
-  German KM Cheung, Simon Baker, and Takeo Kanade. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages I–I. IEEE, 2003.
-  German KM Cheung, Simon Baker, and Takeo Kanade. Visual hull alignment and refinement across time: A 3d reconstruction algorithm combining shape-from-silhouette with stereo. In IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages II–375. IEEE, 2003.
-  Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video. ACM Transactions on Graphics, 34(4):69, 2015.
-  Yan Cui, Will Chang, Tobias Nöll, and Didier Stricker. Kinectavatar: fully automatic body capture using a single kinect. In Asian Conf. on Computer Vision, pages 133–147, 2012.
-  R Daněřek, Endri Dibra, C Öztireli, Remo Ziegler, and Markus Gross. Deepgarment: 3d garment shape estimation from a single image. In Computer Graphics Forum, volume 36, pages 269–280. Wiley Online Library, 2017.
-  Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. In ACM Transactions on Graphics, page 98, 2008.
-  Endri Dibra, Himanshu Jain, Cengiz Öztireli, Remo Ziegler, and Markus Gross. Hs-nets: Estimating human body shape from silhouettes with convolutional neural networks. In International Conf. on 3D Vision, pages 108–117, 2016.
-  Endri Dibra, Himanshu Jain, Cengiz Oztireli, Remo Ziegler, and Markus Gross. Human shape from silhouettes using generative hks descriptors and cross-modal neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Endri Dibra, Cengiz Öztireli, Remo Ziegler, and Markus Gross. Shape from selfies: Human body shape estimation using cca regression forests. In European Conf. on Computer Vision, pages 88–104. Springer, 2016.
-  Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics, 35(4):114, 2016.
-  Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1746–1753. IEEE, 2009.
-  Andrew Gilbert, Marco Volino, John Collomosse, and Adrian Hilton. Volumetric performance capture from minimal camera viewpoints. In European Conf. on Computer Vision, 2018.
-  Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. Instance-level human parsing via part grouping network. In European Conf. on Computer Vision, 2018.
-  Peng Guan, Alexander Weiss, Alexandru O Bălan, and Michael J Black. Estimating human shape and pose from a single image. In IEEE International Conf. on Computer Vision, pages 1381–1388. IEEE, 2009.
-  Yu Guo, Xiaowu Chen, Bin Zhou, and Qinping Zhao. Clothed and naked human shapes estimation from a single image. Computational Visual Media, pages 43–50, 2012.
-  Nils Hasler, Hanno Ackermann, Bodo Rosenhahn, Thorsten Thormahlen, and Hans-Peter Seidel. Multilinear pose and body shape estismation of dressed subjects from image sets. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1823–1830. IEEE, 2010.
-  Nils Hasler, Carsten Stoll, Martin Sunkel, Bodo Rosenhahn, and H-P Seidel. A statistical model of human pose and body shape. In Computer Graphics Forum, volume 28, pages 337–346, 2009.
-  Thomas Helten, Andreas Baak, Gaurav Bharaj, Meinard Muller, Hans-Peter Seidel, and Christian Theobalt. Personalization and evaluation of a real-time depth-based full body tracker. In International Conf. on 3D Vision, pages 279–286, Washington, DC, USA, 2013.
-  Paul Henderson and Vittorio Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In British Machine Vision Conference, 2018.
-  Chun-Hao Huang, Benjamin Allain, Jean-Sébastien Franco, Nassir Navab, Slobodan Ilic, and Edmond Boyer. Volumetric 3d tracking by detection. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 3862–3870, 2016.
-  Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. Volumedeform: Real-time volumetric non-rigid reconstruction. In European Conf. on Computer Vision, 2016.
-  Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In ACM symposium on User interface software and technology, pages 559–568. ACM, 2011.
-  Arjun Jain, Thorsten Thormählen, Hans-Peter Seidel, and Christian Theobalt. Moviereshape: Tracking and reshaping of humans in videos. In ACM Transactions on Graphics, volume 29, page 148. ACM, 2010.
-  Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 8320–8329, 2018.
-  Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In IEEE Conf. on Computer Vision and Pattern Recognition. IEEE Computer Society, 2018.
-  Reinhard Koch, Marc Pollefeys, and Luc Van Gool. Multi viewpoint stereo from uncalibrated video sequences. In European conf. on computer vision, pages 55–71. Springer, 1998.
-  Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Vincent Leroy, Jean-Sébastien Franco, and Edmond Boyer. Multi-View Dynamic Shape Refinement Using Local Temporal Integration. In IEEE International Conf. on Computer Vision, Venice, Italy, 2017.
-  Vincent Leroy, Jean-Sébastien Franco, and Edmond Boyer. Shape reconstruction using volume sweeping and learned photoconsistency. In European Conf. on Computer Vision, pages 796–811. Springer, Cham, 2018.
-  Guannan Li, Chenglei Wu, Carsten Stoll, Yebin Liu, Kiran Varanasi, Qionghai Dai, and Christian Theobalt. Capturing relightable human performances under general uncontrolled illumination. In Computer Graphics Forum (Proc. Eurographics), volume 32, pages 1–8, 2013.
-  Hao Li, Etienne Vouga, Anton Gudym, Linjie Luo, Jonathan T Barron, and Gleb Gusev. 3d self-portraits. ACM Transactions on Graphics, 32(6):187, 2013.
-  Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):248:1–248:16, 2015.
-  Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. Single-shot multi-person 3d pose estimation from monocular rgb. In International Conf. on 3D Vision, sep 2018.
-  Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics, 36(4):44, 2017.
-  Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 343–352, 2015.
-  Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011.
-  Mohamed Omran, Christop Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conf. on 3D Vision, 2018.
-  Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al. Holoportation: Virtual 3d teleportation in real-time. In Symposium on User Interface Software and Technology, pages 741–754. ACM, 2016.
-  Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D human pose and shape from a single color image. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
-  Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael Black. ClothCap: Seamless 4D clothing capture and retargeting. ACM Transactions on Graphics, 36(4), 2017.
-  Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and Michael J Black. Dyna: a model of dynamic human shape in motion. ACM Transactions on Graphics, 34:120, 2015.
-  Alin-Ionut Popa, Mihai Zanfir, and Cristian Sminchisescu. Deep multitask architecture for integrated 2d and 3d human sensing. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conf. on Computer Vision, pages 725–741, 2018.
-  Helge Rhodin, Nadia Robertini, Dan Casas, Christian Richardt, Hans-Peter Seidel, and Christian Theobalt. General automatic human shape and motion capture using volumetric contour cues. In European Conf. on Computer Vision, pages 509–526. Springer, 2016.
-  Nadia Robertini, Dan Casas, Helge Rhodin, Hans-Peter Seidel, and Christian Theobalt. Model-based outdoor performance capture. In International Conf. on 3D Vision, 2016.
-  Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. Lcr-net: Localization-classification-regression for human pose. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Lorenz Rogge, Felix Klose, Michael Stengel, Martin Eisemann, and Marcus Magnor. Garment replacement in monocular video sequences. ACM Transactions on Graphics, 34(1):6, 2014.
-  Ari Shapiro, Andrew Feng, Ruizhe Wang, Hao Li, Mark Bolas, Gerard Medioni, and Evan Suma. Rapid avatar capture and simulation using commodity depth sensors. Computer Animation and Virtual Worlds, 25(3-4):201–211, 2014.
-  Miroslava Slavcheva, Maximilian Baust, Daniel Cremers, and Slobodan Ilic. Killingfusion: Non-rigid 3d reconstruction without correspondences. In IEEE Conf. on Computer Vision and Pattern Recognition, volume 3, page 7, 2017.
-  Cristian Sminchisescu, Atul Kanaujia, and Dimitris Metaxas. Learning joint top-down and bottom-up processes for 3d visual inference. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1743–1752. IEEE, 2006.
-  Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation. IEEE Computer Graphics and Applications, 27(3), 2007.
-  Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In IEEE International Conf. on Computer Vision, pages 951–958. IEEE, 2011.
-  Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. In IEEE International Conf. on Computer Vision, volume 2, 2017.
-  Yu Tao, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Dai Quionhai, Hao Li, G. Pons-Moll, and Yebin Liu. Doublefusion: Real-time capture of human performance with inner body shape from a depth sensor. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
-  Denis Tome, Chris Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Matthew Trumble, Andrew Gilbert, Adrian Hilton, and John Collomosse. Deep autoencoder for combined human pose estimation and body model upscaling. In European Conf. on Computer Vision, sep 2018.
-  Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. Self-supervised learning of motion capture. In Advances in Neural Information Processing Systems, pages 5236–5246, 2017.
-  Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In European Conf. on Computer Vision, 2018.
-  Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. Articulated mesh animation from multi-view silhouettes. In ACM Transactions on Graphics, volume 27, page 97. ACM, 2008.
-  Alexander Weiss, David Hirshberg, and Michael J Black. Home 3d body scans from noisy image and range data. In IEEE International Conf. on Computer Vision, pages 1951–1958. IEEE, 2011.
-  Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, and Stefanie Wuhrer. Analyzing clothing layer deformation statistics of 3d human motions. In European Conf. on Computer Vision, pages 237–253, 2018.
-  Mao Ye and Ruigang Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 2345–2352, 2014.
-  Rui Yu, Chris Russell, Neill D. F. Campbell, and Lourdes Agapito. Direct, dense, and deformable: Template-based non-rigid 3d reconstruction from rgb video. In IEEE International Conf. on Computer Vision, 2015.
-  Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes–the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2148–2157, 2018.
-  Ming Zeng, Jiaxiang Zheng, Xuan Cheng, and Xinguo Liu. Templateless quasi-rigid shape modeling with implicit loop-closure. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 145–152, 2013.
-  Chao Zhang, Sergi Pujades, Michael Black, and Gerard Pons-Moll. Detailed, accurate, human shape estimation from clothed 3D scan sequences. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
-  Qing Zhang, Bo Fu, Mao Ye, and Ruigang Yang. Quality dynamic human body modeling using a single low-cost depth camera. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 676–683. IEEE, 2014.
-  Qian-Yi Zhou and Vladlen Koltun. Color map optimization for 3d reconstruction with consumer depth cameras. ACM Transactions on Graphics, 33(4):155, 2014.
-  Shizhe Zhou, Hongbo Fu, Ligang Liu, Daniel Cohen-Or, and Xiaoguang Han. Parametric reshaping of human bodies in images. In ACM Transactions on Graphics, volume 29, page 126. ACM, 2010.
-  Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3d human pose estimation in the wild: A weakly-supervised approach. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 398–407, 2017.
-  Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics, 33(4):156, 2014.
-  Silvia Zuffi and Michael J Black. The stitched puppet: A graphical model of 3d human shape and pose. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 3537–3546. IEEE, 2015.