Neural Re-Rendering of Humans from a Single Image

Neural Re-Rendering of Humans from a Single Image


Human re-rendering from a single image is a starkly underconstrained problem, and state-of-the-art algorithms often exhibit undesired artefacts, such as over-smoothing, unrealistic distortions of the body parts and garments, or implausible changes of the texture. To address these challenges, we propose a new method for neural re-rendering of a human under a novel user-defined pose and viewpoint, given one input image. Our algorithm represents body pose and shape as a parametric mesh which can be reconstructed from a single image and easily reposed. Instead of a colour-based UV texture map, our approach further employs a learned high-dimensional UV feature map to encode appearance. This rich implicit representation captures detailed appearance variation across poses, viewpoints, person identities and clothing styles better than learned colour texture maps. The body model with the rendered feature maps is fed through a neural image translation network that creates the final rendered colour image. The above components are combined in an end-to-end-trained neural network architecture that takes as input a source person image and images of the parametric body model in the source pose and desired target pose. Experimental evaluation demonstrates that our approach produces higher-quality single image re-rendering results than existing methods.

Neural Rendering, Pose Transfer, Novel View Synthesis.

1 Introduction


Algorithms to realistically render dressed humans under controllable poses and viewpoints are essential for character animation, 3D video, or virtual and augmented reality, to name a few. Over the past decades, computer graphics and vision have developed impressive methods for high-fidelity artist-driven and reconstruction-based human modelling, high-quality animation, and photo-realistic rendering. However, these often require sophisticated multi-camera setups, and deep expertise in animation and rendering, and are thus costly, time-consuming and difficult to use.

Figure 1: Given an image of a person, our neural re-rendering approach allows synthesis of images of the person in different poses, or with different clothing obtained from another reference image.

Recent advances in monocular human reconstruction and neural network-based image synthesis open up a radically different approach to the problem, neural re-rendering of humans from a single image. Given a single reference image of a person, the goal is to synthesise a photo-real image of this person in, for instance, a user-controlled new pose, modified body proportions, the same or different garments, or a combination of these.

There has been tremendous progress in monocular human capture and re-rendering [33, 23, 44, 2, 21, 12, 19, 6, 25] towards this goal. However, owing to the starkly underconstrained nature of the problem, true photo-realism under all possible conditions has not yet been achieved. Methods frequently exhibit unwanted over-smoothing and a lack of details in the rendered image, unrealistic distortions of body parts and garments, or implausible texture alterations.

We, therefore, propose a new algorithm for monocular neural re-rendering of a dressed human under a novel user-defined pose and viewpoint, which has starkly improved visual quality, see Figs. 1, 3, 6, 7. We take inspiration from recent work on neural rendering of general scenes with a continuous [48] or a multi-dimensional feature representation with implicit [50] or explicit [47] occlusion handling that are learned from multi-view images or videos.

Our algorithm represents body pose and shape with the SMPL parametric human surface model [29], which can be easily reposed. Instead of modelling appearance as explicit colour maps, e.g., learned colour-based UV texture maps on the body surface [33, 12], we employ a learned high-dimensional UV feature map to encode appearance. This rich implicit representation learns the detailed appearance variation across poses, viewpoints, person identities and clothing styles. Given a single image of a person, we predict pixel correspondences to the SMPL [29] mesh using DensePose [37]. We then extract partial UV texture maps based on the observed body regions and use a neural network to convert it to a complete UV feature map, with a -dimensional feature per texel.

The UV feature map is then rendered in the desired target pose and passed through a neural image translation network that creates the final rendered image. These components are combined in an end-to-end trained neural architecture. In quantitative experiments and a user study to judge the qualitative results, we show that the visual quality of our results improves over the current state of the art.

Contributions. To summarise, our contributions are as follows:

  • A new end-to-end trainable method that combines monocular parametric 3D body modelling, a learned detail-preserving neural-feature based body appearance representation, and a neural network based image-synthesis network to enable highly realistic human re-rendering from a single image;

  • state-of-the-art results on the DeepFashion dataset [27] which are confirmed with quantitative metrics, and qualitatively with a user study.

2 Related Work

While our proposed approach relates to many sub-fields of visual computing, for brevity we only elaborate on the immediately relevant work on human body re-enactment and neural rendering methods for object and scene rendering.

2.1 Classical Methods for Novel View Synthesis

Earlier methods for image-based 3D reconstruction and novel view synthesis rely on traditional concepts of multi-view geometry, explicit 3D shape and appearance reconstruction, and classical computer graphics or image-based rendering. Methods based on light fields use ray space representations or coarse multi-view geometry models for novel view synthesis [22, 11, 4]. To achieve high quality, dense camera arrays are required, which is impractical. Other algorithms capture and operate on dense depth maps [60], layered depth images [41], 3D point clouds [1, 26, 40], meshes [32, 52], or surfels [36, 5, 55] for dynamic scenes. Multi-view stereo can be combined with fusion algorithms operating with implicit geometry and achieving more temporally consistent reconstructions over short time windows [9, 34, 13]. Dynamic scene capture and novel view synthesis were also shown with a low number of RGB or RGB-D cameras [57, 49, 15, 58]. While reconstruction is fast and feasible with fewer cameras, the coarse approximate geometry often leads to rendering artefacts.

2.2 Neural Rendering of Scenes and Objects

Recently, neural rendering approaches have shown promising results for scenes and objects. Image-based rendering (IBR) methods reconstruct scene geometry with classical techniques and use it to render novel views [8, 7]. Lack of observations can cause high uncertainty in novel views. On the other hand, neural rendering approaches [48, 47, 51, 66] can generate higher-quality results by leveraging collections of training data. Many applications of neural rendering have been recently shown, ranging from synthesising view-dependent effects [66, 51] to learning the shape and appearance priors from sparse data [39, 56].

Only a few works on neural scene representation and rendering can handle dynamic scenes  [19, 28]. Some methods combine explicit dynamic scene reconstruction and traditional graphics rendering with neural re-rendering [31, 19, 18, 51]. Thies et al. [50] combine neural textures with the classical graphics pipeline for novel view synthesis of static objects and monocular video re-rendering. Their technique requires a scene-specific geometric proxy which has to be reconstructed before the training. Instead of more complex joint reasoning of the geometry and appearance needed from the intermediate representation by neural rendering approaches such as that of Sitzmann et al. [48], for our human-specific application scenario the coarse geometry is handled by the posable SMPL mesh, with a feature map similar to the Thies et al. [50] capturing clothing appearance, which includes fine-scaled geometry, and clothing textures.

Several approaches address related problems such as generating images of humans in new poses [62, 3, 30, 33, 35], or body re-enactment from monocular videos [6], which are discussed next.

2.3 Human Re-enactment and Novel View Rendering

Recent work on photo-realistic human body re-enactment and novel view rendering can be sub-classified along various dimensions.

Object-agnostic approaches [44, 43] model deformable objects directly in the image space. Siarohin et al. [44] learn keypoints in a self-supervised manner and capture deformations in the vicinity of the keypoints using affine transforms. Features extracted from the source image are deformed to the target using the predicted transformations and passed on to a generator. Additional predictions of dis-occluded regions indicate to the generator the regions which have to be rendered based on the context. Zhu et al. [65] leverage geometric constraints and optical flow for synthesising novel views of humans from a single image.

Object-specific techniques have the same core components as above, i.e., colour or feature transformation from source to target, occlusion reasoning or inpainting, and photo-realistic image generation from the warped feature or colour image. The key difference is that the feature transformation, occlusion reasoning, and inpainting are guided by an underlying object model, which, in our case, is a parametric human body mesh. Kim et al. [19] achieve full control over the head pose and facial expressions in photo-realistic renderings of a target actor by an adversarial training with a performance of the target actor. DensePose Transfer [33] uses direct texture transfer from the input image to the SMPL model, inpaints the occluded regions of the texture and renders it in a new pose. This image is blended with the image resulting from direct conditional generation from the input image, input Densepose, and target Densepose. Zablotskaia et al. [59] generate subsequent video frames of human motion and use a direct warping guided by the reference frame, previously generated frame, and DensePose representations of the past and future frames. Their method does not rely on an explicit UV texture map. ClothFlow [14] implicitly captures the geometric transformation between the source and target image by estimating dense flow. Chanet al. [6] learn a subject-specific puppeteering system using a video of the subject such that all parts of the subject’s body are seen in advance. The GAN-based rendering is driven by 2D pose extracted from the target subject. Zhou et al. [64] also learn a personalised model using piecewise affine transforms of the part-segmented source image for modelling pose changes, generating the person image in front of a clean background plate, with a second stage fusing a given background image with the generated person’s image. In contrast to Liu et al. [25], we transfer appearance from source to target image using a UV feature map. Instead of directly predicting missing regions of the UV texture map, coordinate-based inpainting [12] predicts correspondence between all regions on the UV texture map and the input image pixels. This results in more texture details in body regions that become dis-occluded when re-posing. As shown in Sec. 4, our UV feature map based approach yields results of much higher quality in comparisons. Shysheya et al. [42] explicitly model the body texture and implicitly handle the shape. In contrast, while we explicitly handle the coarse shape, we use a UV feature map to model the fine-scaled shape and clothing texture implicitly. Lazowa et al. [21] propose an approach for reconstruction of textured 3D human models from a single image. Similar to our approach, it extracts a partial UV texture map using DensePose but inpaints the UV texture map using a GAN based supervision directly applied to the texture map. Additionally — and similar to Alldieck et al. [2] — it predicts a displacement map on top of the SMPL mesh to capture clothing details not present in the SMPL model. Our approach does not explicitly model clothing details.

In contrast to existing methods, we propose a new end-to-end trainable method that combines monocular parametric 3D body modeling [33, 12], a learned neural detail-preserving surface feature representation [50], and a neural image-synthesis network for highly realistic human re-rendering from a single image.

3 Method

Given an image of a person, we synthesise a new image of the person in a different target body pose. Our approach comprises of four distinct steps. The first step uses DensePose [37] to predict dense correspondences between the input image and the SMPL model. This allows a UV texture map to be extracted for the visible regions. The second step uses a U-Net [38] based network, which we term FeatureNet, to construct the full UV feature map from the partial RGB UV texture map . contains a -dimensional feature representation for all texels, both visible and occluded in the source image. The third step takes a target pose as input, and ‘renders’ the UV feature map to produce a -dimensional feature image .

The fourth step uses a generator network based on Pix2PixHD [54], which we term RenderNet, to generate a photorealistic image of the reposed person, from the input Feature image. The overview of our pipeline is shown in Fig. 2.

3.1 Extracting a Partial UV Texture Map from the Input Image

The pixels of the input image are transformed into UV space through matches predicted with DensePose. We use the ResNet-101 based variant of DensePose for predicting the correspondences for the body regions visible in the image. The network is pre-trained on COCO-DensePose dataset and provides body segments and their part-specific U,V coordinates of SMPL model. For easier mapping, the 24 part-specific UV maps are combined to form a single UV Texture map in the format provided in SURREAL dataset [53] through a pre-computed lookup table. Note that one could putatively use monocular 3D pose estimation methods (e.g., [17]) to compute SMPL parameters, and subsequently, the DensePose of the input image. However, frequent misalignments of the predictions with the end-effector positions in the image lead to significant artefacts in the UV texture map for hands and feet in that case and thus such an approach is not advised [2].

Figure 2: Pipeline Overview: Given a source image , we extract the UV texture map of an underlying parametric body mesh model for the body regions visible in the image. FeatureNet converts the partial UV texture map to a full UV feature map, which encodes a richer 16D representation at each texel. Given a new pose , the parametric body mesh can be re-posed and textured with the UV feature map to produce an intermediate feature image . RenderNet converts the intermediate 16-channel feature image to a realistic image.

3.2 Generating the Full UV Feature Map

The partial (on account of occlusion) texture map is converted to a full UV feature map using a U-Net-like convolutional network , which we term FeatureNet. That is,


RenderNet comprises of four down-sampling blocks followed by four up-sampling blocks. Therefore, a partial input texture of the spatial dimension of is transformed into a spatial dimension of in the middle-most layer. Each downsampling block consists of two convolutions followed by maxpool operation. For up-sampling blocks, we use bilinear upsampling followed by two convolutions. The final convolutional layer produces a -dimensional (channel) UV feature map which is used subsequently for rendering a feature image. The first three channels of the UV feature map can be supervised to in-paint the input partial UV texture map , thus having a small subset of the feature channels resemble the classical colour texture map. Our experiments use 16 feature channels.

3.3 Intermediate Feature Image Rendering

The SMPL mesh can be reposed using a target pose , which can be extracted from a target image , or obtained from a different source. In our case, when given a target image , we directly obtain the DensePose output, which is equivalent to the reposed SMPL model. Given the source feature map , we render the SMPL mesh through the DensePose output to produce a -dimensional feature image . That is,


Note that this feature rendering operation can be conveniently implemented by differentiable sampling. In our experiments, we use bilinear sampling for this operation. The feature image , which captures the target pose and the source appearance is then used as input to the subsequent translation network.

3.4 Creating a Photo-Realistic Rendering

In the final step, the feature image is translated to a realistic image using a translation network similar to Pix2Pix, which we term RenderNet:


RenderNet comprises of (a) 3 down-sampling blocks, (b) 6 residual blocks, (c) 3 up-sampling blocks and finally (d) a convolution layer with Tanh activation that gives the final output. The discriminator for adversarial training of RenderNet also uses the multiscale design of Pix2PixHD [54]. In our experiments, we use a three scale discriminator network for adversarial training.

3.5 Training Details and Loss Functions

During training, our system takes pairs of images (, ) of the same person (but in different poses) as input. Partial texture extracted from the source image is passed through the above-mentioned operations to produce the generated output . That is,


Note that all operations and are differentiable. We train the entire system end-to-end and optimise the parameters of FeatureNet () and RenderNet (). We use the combination of the following loss functions in our system:

  • Perceptual Loss: We use a perceptual loss based on the VGG network [16] — the difference between the activations on different layers of the pre-trained VGG network [46] applied on the generated image and ground-truth image target image .


    Here, is the activation and the number of elements of the j-th layer in the VGG network pre-trained on ImageNet.

  • Adversarial Loss: We use a multiscale discriminator of Pix2PixHD [54] for enforcing adversarial loss in our system. The multiscale discriminator is conditioned on both the generated and rendered feature images.

  • Face Identity Loss: We use a pre-trained network to ensure that the extracted UV feature map and RenderNet preserve the face identity on the cropped face of the generated and the ground-truth image:


    Here, is the pre-trained SphereFaceNet [24]

  • Intermediate In-painting Loss: To mimic classical colour texture map, we enforce a loss on the first three channels of the output of the in-painting network. This loss is set to the sum of 1) distance of the visible part of the partial source texture and generated texture and 2) distance of the visible part of the partial target texture and generated texture.

The final loss on the generator is then


The conditional discriminator is updated every step enforcing binary cross-entropy loss on real and fake images. We train the networks end-to-end using Adam optimiser [20] with an initial learning rate of 2, as 0.5 and no weight decay. The loss weights are set empirically to . For speed, we pre-compute DensePose on the images and directly read them as input.

During testing, the system takes as input a single image of a person and a target DensePose. The target pose can be extracted by DensePose RCNN on the image of the source person in a different pose (used in the experiments on DeepFashion dataset), or alternatively it can be obtained by reposing the SMPL mesh of the source body. In many cases, the actor can be a completely different person (see Figs. 5 and 7). The neural texture is then rendered using the given target Densepose which is followed by the translation network to generate a realistic image of the source person in the target pose.

4 Experimental Results

4.1 Experimental Setup


We use the In-shop Clothes Retrieval Benchmark of DeepFashion dataset [27] for our main experiments. The dataset comprises of 52,712 images of fashion models with 13,029 different clothing items in different poses. For training and testing, we consider the split provided by Siarohin et al. [45], which is also used by other related works [33, 12]. We also show qualitative results of our method with Fashion dataset [59]. Fashion dataset has 500 training and 100 test videos, each containing roughly 350 frames. The videos are single person sequences, containing different people catwalking in different clothes.

Figure 3: Results of our method, CBI [12], DSC [45], VUnet [10] and DPT [33]. Our approach produces higher-quality renderings than the competing methods.

4.2 Comparison with the State of the Art

We compare our results with four state-of-the-art methods, i.e., Coordinate Based Inpainting (CBI) [12], Deformable GAN (DSC) [45], Variational U-Net (VUnet) [10] and Dense Pose Transfer (DPT) [33]. The qualitative results are shown in Fig. 3. It can be observed that our results show higher realism and better preserve identity and garment details compared to the other methods.

The quantitative results are provided in Table 1. Due to inconsistent reporting (or unavailability) of the metrics for the existing approaches, we computed them ourselves. To this end, we collected the results of 176 testing pairs for each state-of-the-art method (the testing pairs and results were kindly provided by the authors of CBI [12]) and used them for this report. We use the following two metrics for comparison, i.e., 1) Structural Similarity Index (SSIM) [63] 2) Learned Perceptual Image Patch Similarity (LPIPS) [61]. SSIM is a structure preservation metric widely used in the existing literature. Though it is an excellent metric for assessment of image degradation quality, it often does not reflect human perception [61]. On the other hand, the recently introduced LPIPS claims to capture human judgment better than existing hand-designed metrics. In terms of SSIM, we perform as well as the existing methods, whereas we significantly outperform them on LPIPS metric. Please note that similar to other learning-based methods, our approach will struggle with poses that are far from those seen in the training set. However, our method performs well in such scenarios for many cases. Qualitative results on some target poses outside of training dataset distribution are shown in Fig. 5.

CBI [12] 0.766 0.178
DSC [45] 0.750 0.214
VUnet [10] 0.739 0.202
DPT [33] 0.759 0.206
Ours 0.768 0.164
GT 1.0 0.0
Table 1: Comparison with state-of-the-art methods using various perceptual metrics, Structural Similarity Index (SSIM) [63] and Learned Perceptual Image Patch Similarity (LPIPS) [61]. () means higher (lower) is better.

4.3 User Study

To assess the qualitative impact of our method, we perform an extensive user study which compares it with two other state-of-the-art pose transfer methods – Coordinate Base Inpainting (CBI) [12] and DensePose Transfer (DPT) [33]. We train on the DeepFashion dataset [27] and generate renderings on the test split. The user study follows several criteria. First, it covers as large a variety of source and target poses. Second, the ratio between the male and female samples reflects the same ratio of the dataset. It also contains failure cases as those shown in Fig. 8 with difficult decisions. In total, we prepare samples containing the source image (explicitly marked as such) and three novel views generated by CBI, DPT and our method (labeled as view A, B or C in randomised order). For each sample, two questions are asked: 1) Which view looks the most like the person in the source image? and 2) Which view looks the most realistic?

The user study was performed with a browser interface, the order of questions is randomised, and anonymous participants submitted their answers. The results are as follows. The first question has been answered in of the cases in favour of CBI, and in of the cases in favour of our method. In all cases, DPT has always been the last choice. The second question has been answered by of the participants in favour of CBI, and by of the participants in favour of our approach. Again, DPT was preferred in no case.

The user study shows that our method achieves state-of-the-art quality in preserving the identity, and significantly outperforms the baselines in the realism of the generated images. In of the overall cases, the participants have preferred CBI as the best identity-preserving method and, at the same time, our method as those producing the most realistic renderings. In contrast, there was only one case () when our method had been voted as the best identity-preserving and, at the same time, CBI was chosen as the approach producing most realistic renderings.

Figure 4: Results of different baselines and our full model. No-Int has no intermediate loss, Warp and WarpCond perform translation on warped partial texture and IP inpaints full colour-texture followed by translation. Under extreme poses and strong occlusions, our method outperforms all the baselines (see Sec. 4.4).
Figure 5: Generalisation of our method to new body poses. The images of the target pose are obtained from the internet.

4.4 Ablation Study

To study the advantage of the learned neural texture over other natural choices of texture-based human re-rendering, we created the following three baselines.

Ip. This baseline involves two stages. First, we train an inpainting network to generate the full UV texture map from the partial UV texture map extracted from the input image. We use the same in-painting loss function as described in Section 3.5 for training this network. After the convergence, we fix and use this network to generate full colour texture from a partial input texture. This full 3-channel UV texture map is then rendered into an intermediate image, and translated through a trained RenderNet .

Warp. In this experiment, we warp the incomplete partial UV texture map to the target pose. The reposed incomplete 3-channel intermediate image is then fed to a trained RenderNet to produce realistic output.

WarpCond. In this experiment, we warp the partial source texture to the target pose similar to the previous experiment. In addition to the reposed incomplete texture, we also condition the generator network with the target DensePose image. The target DensePose image acts as a cue to the generator when the texture information is missing.

Figure 6: Garment Transfer: Our approach can also be used to render garments from a source image onto the person in a target image.

In all these baselines, the architecture of RenderNet and the losses on the generated image are the same as ours. The only difference is in the number of input channels to RenderNet. Besides, we perform an ablation experiment with an identical pipeline as ours, except we do not enforce any intermediate texture loss, denoted as No-Int. The qualitative results of all the networks are shown in Fig. 4. It can be seen that our methods using a richer intermediate representation (full and No-Int) produce more realistic images than the other baselines. Baseline-IP performs well but produces smooth output compared to the other methods. Because of the lack of details, Baseline-Warp often produces non-realistic output in both face and garment regions. When the incomplete texture information is supervised with additional DesnsePose image (as in Baseline-WarpCond), the output is of higher quality. In the presence of strong occlusions, the method fails, as the translation network is incapable of performing both inpainting and realistic rendering at the same time. In contrast, our methods performed well in all the scenarios. Adding the intermediate texture loss (to mimic real texture) to the part of neural texture helps our network to converge faster. However, over a large number of iterations, the quality of the final result without such intermediate loss (No-Int) is similar to that with intermediate loss.

4.5 Garment Transfer

Our method can be naturally extended to perform garment transfer without any further training. Given an image of a person with the source body, we extract the partial texture of the body regions (e.g., face, hands and regions with garments which remain unchanged). We use part indices provided by DensePose to extract the partial texture of the required body parts. Next, we extract the partial texture of the ‘garment regions’ of an image with the desired target garments. We make a union of the extracted partial textures based on their texel regions and feed it to our pipeline with the pose of the body image. Note that texel occupancies of and are mutually exclusive as they are extracted from different body parts. See Fig. 6 for the qualitative results.

4.6 Motion Transfer

Even though we did not train specifically for generating videos, our method can be applied to each frame of a driving video to create motion transfer. To this end, we keep the source image of the imitator fixed and use the pose from the actor of the driving video (for each frame) in our system to create image animation. We perform the experiment on Fashion Dataset [59] and show our results in Fig. 7.

Figure 7: Motion transfer on the Fashion dataset [59]. Our approach also can generate realistic renderings for a sequence of poses given a single source image.
Figure 8: Limitations: Even though our method produces better quality results than all competing approaches, it nevertheless has some limitations, which are also shared by all competing methods. The top row highlights failures arising out of biases in the training set, while the bottom row highlights failures owing to fine scaled textures which are not effectively captured by any approach.

Please refer to the Appendix A and the accompanying video for more results.

5 Discussion

Limitations. Even though we produce high-quality novel views which are preserving the identity and look very realistic, there remain certain limitations for future work to address. Fig. 8 visualises two representative examples which are difficult for our as well as competing methods. In the first row, the head in the source image is only partially visible so that the methods have high uncertainty in the frontal facial view and change the gender to female (e.g., hallucinate long hair). In the second case, the source texture is too fine-grained for the methods so that some hallucinate repeating patterns and the other ones generate patterns reminiscent of noise. In this case, our method generates a texture which is neither repetitive nor looks like noise, and which is still far from the reference.

Future Extension of Our UV Feature Maps. Instead of sampling RGB textures from the input image to construct a partial UV texture map, learned CNN based features could be used to construct a more informative partial UV feature map, which putatively captures off-geometry details not modelled by the SMPL mesh. Then FeatureNet would convert this partial UV feature map to a full UV feature map. Another alternative would be to use displacement map prediction similar to prior work [21, 2] to capture off-geometry details.

6 Conclusion

In this work, we present an approach for human image synthesis, which allows us to change the camera view and the pose and garments of the subject in the source image. Our approach uses a high-dimensional UV feature map to encode appearance as an intermediate representation, which is then re-posed and translated using a generator network to achieve realistic rendering of the input subject in a novel pose. We qualitatively and quantitatively demonstrate the efficacy of our proposed approach at better preserving identity and garment details compared to the other competing methods. Our system, trained once for pose-guided image synthesis, can be directly used for other tasks such as garment and motion transfer.


This work was supported by the ERC Consolidator Grant 4DReply (770784).

Appendix A Appendix

This appendix provides details of the network architectures employed, as well as snippets from the user study.

a.1 Network Architecture


FeatureNet is a U-Net-based network that construct the full UV feature map from the partial RGB UV texture map. The network architecture is shown in Fig. 9. FeatureNet comprises of four down-sampling blocks followed by four up-sampling blocks. We use the texture of resolution 256 256. At the middle most layer the intput is transformed to an activation volume of spatial dimension of 16 16. In all the figures, the 16 dimensional feature image is visualized by projecting it to three dimensions by a fixed random matrix.


Rendernet translates the rendered feature image (from source Feature Map and target pose) to a photorealistic image. The network architecture is shown in Fig. 10. Both RendereNet and FeatureNet are trained together end-to-end following the full pipelilne (Fig. 2).

a.2 Further Results

In Figs. 11 and 12 we show our results and its comparison to Coordinate Based Inpainting (CBI) [12] and DensePose Transfer (DPT) [33]. Here, we present the list of the figures that was used in the user study.

Figure 9: FeatureNet (): Network 1 of our full pipeline (Fig. 2). FeatureNet converts the partial UV texture map to a full UV feature map, which encodes a richer -dimensional representation at each texel. DS-<M>Denotes a DownSampling block containing MaxPool2D and double convolution with ouput features. Similar is the case for US and Conv block.
Figure 10: RenderNet (): Network 2 of our full pipeline (Fig. 2). RenderNet is a translation network that translates the rendered -dimensional feature image to a photorealistic image. The network comprises of (a) 3 down-sampling blocks, (b) 6 residual blocks, (c) 3 up-sampling blocks and finally (d) a convolution layer with tanh activation that gives the final output.
Figure 11: The first samples from the used study (out of ). We show the source image and three views generated by CBI [12], DPT [33] and our method, in a randomised order. The keys—which were not exposed during the user study—are shown in orange. See Fig. 12 for the remaining samples.
Figure 12: The further samples from the used study (out of ).


  1. Project webpage:


  1. S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz and R. Szeliski (2011) Building rome in a day. Communications of the ACM 54 (10), pp. 105–112. Cited by: §2.1.
  2. T. Alldieck, G. Pons-Moll, C. Theobalt and M. Magnor (2019) Tex2Shape: detailed full human body geometry from a single image. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.3, §3.1, §5.
  3. G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand and J. V. Guttag (2018) Synthesizing images of humans in unseen poses. Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2.
  4. C. Buehler, M. Bosse, L. McMillan, S. J. Gortler and M. F. Cohen (2001) Unstructured lumigraph rendering. In SIGGRAPH, Cited by: §2.1.
  5. R. L. Carceroni and K. N. Kutulakos (2002) Multi-view scene capture by surfel sampling: from video streams to non-rigid 3d motion, shape and reflectance. International Journal of Computer Vision (IJCV) 49 (2), pp. 175–214. Cited by: §2.1.
  6. C. Chan, S. Ginosar, T. Zhou and A. A. Efros (2019) Everybody dance now. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.2, §2.3.
  7. G. Chaurasia, S. Duchêne, O. Sorkine-Hornung and G. Drettakis (2013) Depth synthesis and local warps for plausible image-based navigation. ACM Transactions on Graphics 32. Cited by: §2.2.
  8. P. Debevec, Y. Yu and G. Borshukov (1998) Efficient view-dependent image-based rendering with projective texture-mapping. Eurographics Workshop on Rendering. Cited by: §2.2.
  9. M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor and et al. (2016) Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. 35 (4). Cited by: §2.1.
  10. P. Esser, E. Sutter and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. In Computer Vision and Pattern Recognition (CVPR), pp. 8857–8866. Cited by: Figure 3, §4.2, Table 1.
  11. S. J. Gortler, R. Grzeszczuk, R. Szeliski and M. F. Cohen (1996) The lumigraph. In SIGGRAPH, pp. 43–54. Cited by: §2.1.
  12. A. K. Grigor’ev, A. Sevastopolsky, A. Vakhitov and V. S. Lempitsky (2019) Coordinate-based texture inpainting for pose-guided human image generation. Computer Vision and Pattern Recognition (CVPR), pp. 12127–12136. Cited by: Figure 11, §A.2, §1, §1, §2.3, §2.3, Figure 3, §4.1.1, §4.2, §4.2, §4.3, Table 1.
  13. K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai and Y. Liu (2017) Real-time geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Trans. Graph. 36 (4). Cited by: §2.1.
  14. X. Han, X. Hu, W. Huang and M. R. Scott (2019-10) ClothFlow: a flow-based model for clothed person generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.3.
  15. Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. Legendre, L. Luo, C. Ma and H. Li (2018) Deep volumetric video from very sparse multi-view performance capture. In European Conference on Computer Vision (ECCV), pp. 351–369. Cited by: §2.1.
  16. J. Johnson, A. Alahi and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), pp. 694–711. Cited by: 1st item.
  17. A. Kanazawa, M. J. Black, D. W. Jacobs and J. Malik (2018) End-to-end recovery of human shape and pose. In Computer Vision and Pattern Regognition (CVPR), Cited by: §3.1.
  18. H. Kim, M. Elgharib, H. Zollöfer, T. Beeler, C. Richardt and C. Theobalt (2019) Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG) 38 (6), pp. 178:1–13. Cited by: §2.2.
  19. H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nießner, P. Pérez, C. Richardt, M. Zollöfer and C. Theobalt (2018) Deep video portraits. ACM Transactions on Graphics (TOG) 37. Cited by: §1, §2.2, §2.3.
  20. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.5.
  21. V. Lazova, E. Insafutdinov and G. Pons-Moll (2019) 360-degree textures of people in clothing from a single image. International Conference on 3D Vision (3DV), pp. 643–653. Cited by: §1, §2.3, §5.
  22. M. Levoy and P. Hanrahan (1996) Light field rendering. In SIGGRAPH, pp. 31–42. Cited by: §2.1.
  23. L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang and C. Theobalt (2019) Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG). Cited by: §1.
  24. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In Computer Vision and Pattern Recognition (CVPR), pp. 212–220. Cited by: 3rd item.
  25. W. Liu, Z. Piao, M. Jie, W. Luo, L. Ma and S. Gao (2019) Liquid warping gan: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.3.
  26. Y. Liu, Q. Dai and W. Xu (2010) A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE Transactions on Visualization and Computer Graphics (TVCG) 16 (3), pp. 407–418. Cited by: §2.1.
  27. Z. Liu, P. Luo, S. Qiu, X. Wang and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Computer Vision and Pattern Recognition (CVPR), pp. 1096–1104. Cited by: 2nd item, §4.1.1, §4.3.
  28. S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. (SIGGRAPH) 38 (4). Cited by: §2.2.
  29. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll and M. J. Black (2015-10) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §1.
  30. L. Ma, Q. Sun, S. Georgoulis, L. van Gool, B. Schiele and M. Fritz (2018) Disentangled person image generation. Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2.
  31. R. Martin Brualla, P. Lincoln, A. Kowdle, C. Rhemann, D. Goldman, C. Keskin, S. Seitz, S. Izadi, S. Fanello, R. Pandey, S. Yang, P. Pidlypenskyi, J. Taylor, J. Valentin, S. Khamis, P. Davidson and A. Tkach (2018) LookinGood: enhancing performance capture with real-time neural re-rendering. ACM Transactions on Graphics (TOG) 37. Cited by: §2.2.
  32. T. Matsuyama, Xiaojun Wu, T. Takai and T. Wada (2004) Real-time dynamic 3-d object shape reconstruction and high-fidelity texture mapping for 3-d video. IEEE Transactions on Circuits and Systems for Video Technology 14 (3), pp. 357–369. Cited by: §2.1.
  33. N. Neverova, R. A. Güler and I. Kokkinos (2018) Dense pose transfer. European Conference on Computer Vision (ECCV). Cited by: Figure 11, §A.2, §1, §1, §2.2, §2.3, §2.3, Figure 3, §4.1.1, §4.2, §4.3, Table 1.
  34. S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou and et al. (2016) Holoportation: virtual 3d teleportation in real-time. In Annual Symposium on User Interface Software and Technology, pp. 741–754. Cited by: §2.1.
  35. R. Pandey, A. Tkach, S. Yang, P. Pidlypenskyi, J. Taylor, R. Martin-Brualla, A. Tagliasacchi, G. Papandreou, P. Davidson, C. Keskin, S. Izadi and S. Fanello (2019) Volumetric capture of humans with a single rgbd camera via semi-parametric learning. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  36. H. Pfister, M. Zwicker, J. van Baar and M. Gross (2000) Surfels: surface elements as rendering primitives. In SIGGRAPH, pp. 335–342. Cited by: §2.1.
  37. I. K. Rieza Alp Gueler (2018) DensePose: dense human pose estimation in the wild. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.
  38. O. Ronneberger, P.Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Cited by: §3.
  39. S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. International Conference on Computer Vision (ICCV). Cited by: §2.2.
  40. J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113. Cited by: §2.1.
  41. J. Shade, S. Gortler, L. He and R. Szeliski (1998) Layered depth images. In SIGGRAPH, pp. 231–242. Cited by: §2.1.
  42. A. Shysheya, E. Zakharov, K. Aliev, R. Bashirov, E. Burkov, K. Iskakov, A. Ivakhnenko, Y. Malkov, I. Pasechnik, D. Ulyanov, A. Vakhitov and V. Lempitsky (2019) Textured neural avatars. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  43. A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci and N. Sebe (2019) Animating arbitrary objects via deep motion transfer. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  44. A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci and N. Sebe (2019) First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.3.
  45. A. Siarohin, S. Lathuilière, E. Sangineto and N. Sebe (2019) Appearance and pose-conditioned human image generation using deformable gans. Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: Figure 3, §4.1.1, §4.2, Table 1.
  46. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 1st item.
  47. V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein and M. Zollhöfer (2019) DeepVoxels: learning persistent 3d feature embeddings. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
  48. V. Sitzmann, M. Zollhöfer and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.2, §2.2.
  49. Y. Tao, Z. Zheng, K. Guo, J. Zhao, D. Quionhai, H. Li, G. Pons-Moll and Y. Liu (2018) DoubleFusion: real-time capture of human performance with inner body shape from a depth sensor. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  50. J. Thies, M. Zollhöfer and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38. Cited by: §1, §2.2, §2.3.
  51. J. Thies, M. Zollhöfer, C. Theobalt, M. Stamminger and M. Nießner (2020) Image-guided neural object rendering. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §2.2.
  52. T. Tung, S. Nobuhara and T. Matsuyama (2009) Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In International Conference on Computer Vision (ICCV), pp. 1709–1716. Cited by: §2.1.
  53. G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev and C. Schmid (2017) Learning from synthetic humans. In Computer Vision and Pattern Regognition (CVPR), Cited by: §3.1.
  54. T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Computer Vision and Pattern Recognition (CVPR), Cited by: 2nd item, §3.4, §3.
  55. M. Waschbüsch, S. Würmlin, D. Cotting, F. Sadlo and M. Gross (2005) Scalable 3d video of dynamic scenes. The Visual Computer 21 (8), pp. 629–638. Cited by: §2.1.
  56. Z. Xu, S. Bi, K. Sunkavalli, S. Hadap, H. Su and R. Ramamoorthi (2019) Deep view synthesis from sparse photometric images. ACM Trans. Graph. 38 (4), pp. 76:1–76:13. Cited by: §2.2.
  57. T. Yu, K. Guo, F. Xu, Y. Dong, Z. Su, J. Zhao, J. Li, Q. Dai and Y. Liu (2017) BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In International Conference on Computer Vision (ICCV), pp. 910–919. Cited by: §2.1.
  58. T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll and Y. Liu (2019) SimulCap: single-view human performance capture with cloth simulation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  59. P. Zablotskaia, A. Siarohin, L. Sigal and B. Zhao (2019) DwNet: dense warp-based network for pose-guided human video generation. In British Machine Vision Conference (BMVC), Cited by: §2.3, Figure 7, §4.1.1, §4.6.
  60. L. Zhang, B. Curless and S. M. Seitz (2003) Spacetime stereo: shape recovery for dynamic scenes. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  61. R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, Table 1.
  62. B. Zhao, X. Wu, Z. Cheng, H. Liu, Z. Jie and J. Feng (2018) Multi-view image generation from a single-view. In ACM International Conference on Multimedia, pp. 383–391. Cited by: §2.2.
  63. Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.2, Table 1.
  64. Y. Zhou, Z. Wang, C. Fang, T. Bui and T. L. Berg (2019) Dance dance generation: motion transfer for internet videos. In International Conference on Computer Vision Workshops (ICCVW), Cited by: §2.3.
  65. H. Zhu, H. Su, P. Wang, X. Cao and R. Yang (2018) View extrapolation of human body from a single image. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  66. J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum and B. Freeman (2018) Visual object networks: image generation with disentangled 3d representations. In Conference on Neural Information Processing Systems (NeurIPS), pp. 118–129. Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description