Animating Arbitrary Objects via Deep Motion Transfer
This paper introduces a novel deep learning framework for image animation. Given an input image with a target object and a driving video sequence depicting a moving object, our framework generates a video in which the target object is animated according to the driving sequence. This is achieved through a deep architecture that decouples appearance and motion information. Our framework consists of three main modules: (i) a Keypoint Detector unsupervisely trained to extract object keypoints, (ii) a Dense Motion prediction network for generating dense heatmaps from sparse keypoints, in order to better encode motion information and (iii) a Motion Transfer Network, which uses the motion heatmaps and appearance information extracted from the input image to synthesize the output frames. We demonstrate the effectiveness of our method on several benchmark datasets, spanning a wide variety of object appearances, and show that our approach outperforms state-of-the-art image animation and video generation methods. Our source code is publicly available 111 https://github.com/AliaksandrSiarohin/monkey-net.
This paper introduces a framework for motion-driven image animation to automatically generate videos by combining the appearance information derived from a source image (e.g. depicting the face or the body silhouette of a certain person) with motion patterns extracted from a driving video (e.g. encoding the facial expressions or the body movements of another person). Several examples are given in Fig. 1. Generating high-quality videos from static images is challenging, as it requires learning an appropriate representation of an object, such as a 3D model of a face or a human body. This task also requires accurately extracting the motion patterns from the driving video and mapping them on the object representation. Most approaches are object-specific, using techniques from computer graphics [7, 42]. These methods also use an explicit object representation, such as a 3D morphable model , to facilitate animation, and therefore only consider faces.
Over the past few years, researchers have developed approaches for automatic synthesis and enhancement of visual data. Several methods derived from Generative Adversarial Networks (GAN)  and Variational Autoencoders (VAE)  have been proposed to generate images and videos [22, 36, 34, 43, 33, 41, 40, 37]. These approaches use additional information such as conditioning labels (e.g. indicating a facial expression, a body pose) [49, 35, 16, 39]. More specifically, they are purely data-driven, leveraging a large collection of training data to learn a latent representation of the visual inputs for synthesis. Noting the significant progress of these techniques, recent research studies have started exploring the use of deep generative models for image animation and video retargeting [50, 9, 4, 47, 3]. These works demonstrate that deep models can effectively transfer motion patterns between human subjects in videos , or transfer a facial expression from one person to another . However, these approaches have limitations: for example, they rely on pre-trained models for extracting object representations that require costly ground-truth data annotations [9, 47, 3]. Furthermore, these works do not address the problem of animating arbitrary objects: instead, considering a single object category  or learning to translate videos from one specific domain to another [4, 25].
This paper addresses some of these limitations by introducing a novel deep learning framework for animating a static image using a driving video. Inspired by , we propose learning a latent representation of an object category in a self-supervised way, leveraging a large collection of video sequences. There are two key distinctions between our work and . Firstly, our approach is not designed for specific object category, but rather is effective in animating arbitrary objects. Secondly, we introduce a novel strategy to model and transfer motion information, using a set of sparse motion-specific keypoints that were learned in an unsupervised way to describe relative pixel movements. Our intuition is that only relevant motion patterns (derived from the driving video) must be transferred for object animation, while other information should not be used. We call the proposed deep framework Monkey-Net, as it enables motion transfer by considering MOviNg KEYpoints.
We demonstrate the effectiveness of our framework by conducting an extensive experimental evaluation on three publicly available datasets, previously used for video generation: the Tai-Chi , the BAIR robot pushing  and the UvA-NEMO Smile  datasets. As shown in our experiments, our image animation method produces high quality videos for a wide range of objects. Furthermore, our quantitative results clearly show that our approach outperforms state-of-the-art methods for image-to-video translation tasks.
2 Related work
Deep Video Generation. Early deep learning-based approaches for video generation proposed synthesizing videos by using spatio-temporal networks. Vondrick et al.  introduced VGAN, a 3D convolutional GAN which simultaneously generates all the frames of the target video. Similarly, Saito et al.  proposed TGAN, a GAN-based model which is able to generate multiple frames at the same time. However, the visual quality of these methods outputs is typically poor.
More recent video generation approaches used recurrent neural networks within an adversarial training framework. For instance, Wang et al.  introduced a Conditional MultiMode Network (CMM-Net), a deep architecture which adopts a conditional Long-Short Term Memory (LSTM) network and a VAE to generate face videos. Tulyakov et al.  proposed MoCoGAN, a deep architecture based on a recurrent neural network trained with an adversarial learning scheme. These approaches can take conditional information as input that comprises categorical labels or static images and, as a result, produces high quality video frames of desired actions.
Video generation is closely related to the future frame prediction problem addressed in [38, 30, 14, 44, 52]. Given a video sequence, these methods aim to synthesize a sequence of images which represents a coherent continuation of the given video. Earlier methods [38, 30, 26] attempted to directly predict the raw pixel values in future frames. Other approaches [14, 44, 2] proposed learning the transformations which map the pixels in the given frames to the future frames. Recently, Villegas et al.  introduced a hierarchical video prediction model consisting of two stages: it first predicts the motion of a set of landmarks using an LSTM, then generates images from the landmarks.
Our approach is closely related to these previous works since we also aim to generate video sequences by using a deep learning architecture. However, we tackle a more challenging task: image animation requires decoupling and modeling motion and content information, as well as a recombining them.
Object Animation. Over the years, the problems of image animation and video re-targeting have attracted attention from many researchers in the fields of computer vision, computer graphics and multimedia. Traditional approaches [7, 42] are designed for specific domains, as they operate only on faces, human silhouettes, etc. In this case, an explicit representation of the object of interest is required to generate an animated face corresponding to a certain person’s appearance, but with the facial expressions of another. For instance, 3D morphable models  have been traditionally used for face animation . While especially accurate, these methods are highly domain-specific and their performance drastically degrades in challenging situations, such as in the presence of occlusions.
Image animation from a driving video can be interpreted as the problem of transferring motion information from one domain to another. Bansal et al.  proposed Recycle-GAN, an approach which extends conditional GAN by incorporating spatio-temporal cues in order to generate a video in one domain given a video in another domain. However, their approach only learns the association between two specific domains, while we want to animate an image depicting one object without knowing at training time which object will be used in the driving video. Similarly, Chan et al.  addressed the problem of motion transfer, casting it within a per-frame image-to-image translation framework. They also proposed incorporating spatio-temporal constraints. The importance of considering temporal dynamics for video synthesis was also demonstrated in . Wiles et al.  introduced X2Face, a deep architecture which, given an input image of a face, modifies it according to the motion patterns derived from another face or another modality, such as audio. They demonstrated that a purely data-driven deep learning-based approach is effective in animating still images of faces without demanding explicit 3D representation. In this work, we design a self-supervised deep network for animating static images, which is effective for generating arbitrary objects.
The architecture of the Monkey-Net is given in Fig. 2. We now describe it in detail.
3.1 Overview and Motivation
The objective of this work is to animate an object based on the motion of a similar object in a driving video. Our framework is articulated into three main modules (Fig. 2). The first network, named Keypoint Detector, takes as input the source image and a frame from the driving video and automatically extracts sparse keypoints. The output of this module is then fed to a Dense Motion prediction network, which translates the sparse keypoints into motion heatmaps. The third module, the Motion Transfer network, receives as input the source image and the dense motion heatmap and recombines them producing a target frame.
The output video is generated frame-by-frame as illustrated in Fig. 2.a. At time , the Monkey-Net uses the source image and the frame from the driving video. In order to train a Monkey-Net one just needs a dataset consisting of videos of objects of interest. No specific labels, such as keypoint annotations, are required. The learning process is fully self-supervised. Therefore, at test time, in order to generate a video sequence, the generator requires only a static input image and a motion descriptor from the driving sequence. Inspired by recent studies on unsupervised landmark discovery for learning image representations [23, 51], we formulate the problem of learning a motion representation as an unsupervised motion-specific keypoint detection task. Indeed, the keypoints locations differences between two frames can be seen as a compact motion representation. In this way, our model generates a video by modifying the input image according to the landmarks extracted from the driving frames. Using a Monkey-Net at inference time is detailed in Sec. 3.6.
The Monkey-Net architecture is illustrated in Fig. 2.b. Let and be two frames of size extracted from the same video. The lattice is denoted by . Inspired by , we jointly learn a keypoint detector together with a generator network according to the following objective: should be able to reconstruct from the keypoint locations , , and . In this formulation, the motion between and is implicitly modeled. To deal with large motions, we aim to learn keypoints that describe motion as well as the object geometry. To this end, we add a third network that estimates the optical flow between and from , and . The motivation for this is twofold. First, this forces the keypoint detector to predict keypoint locations that capture not only the object structure but also its motion. To do so, the learned keypoints must be located especially on the object parts with high probability of motion. For instance, considering the human body, it is important to obtain keypoints on the extremities (as in feet or hands) in order to describe the body movements correctly, since these body-parts tend to move the most. Second, following common practises in conditional image generation, the generator is implemented as an encoder-decoder composed of convolutional blocks . However, standard convolutional encoder-decoders are not designed to handle large pixel-to-pixel misalignment between the input and output images [35, 3, 15]. To this aim, we introduce a deformation module within the generator that employs the estimated optical flow in order to align the encoder features with .
3.2 Unsupervised Keypoint Detection
In this section, we detail the structure employed for unsupervised keypoint detection. First, we employ a standard U-Net architecture that, from the input image, estimates heatmaps , one for each keypoint. We employ softmax activations for the last layer of the decoder in order to obtain heatmaps that can be interpreted as detection confidence map for each keypoint. An encoder-decoder architecture is used here since it has shown good performance for keypoints localization [6, 31].
To model the keypoint location confidence, we fit a Gaussian on each detection confidence map. Modeling the landmark location by a Gaussian instead of using directly the complete heatmap acts as a bottle-neck layer, and therefore allows the model to learn landmarks in an indirect way. The expected keypoint coordinates and its covariance are estimated according to:
The intuition behind the use of keypoint covariances is that they can capture not only the location of a keypoint but also its orientation. Again considering the example of the human body: in the case of the legs, the covariance may capture their orientation. Finally, we encode the keypoint distributions as heatmaps , such that they can be used as inputs to the generator and to the motion networks. Indeed, the advantage of using a heatmap representation, rather than considering directly the 2D coordinates , is that heatmaps are compatible with the use of convolutional neural networks. Formally, we employ the following Gaussian-like function:
where is normalization constant. This process is applied independently on and leading to two sets of keypoints heatmaps and .
3.3 Generator Network with Deformation Module
In this section, we detail how we reconstruct the target frame from , and . First we employ a standard convolutional encoder composed of a sequence of convolutions and average pooling layers in order to encode the object appearance in . Let denote the output of the block of the encoder network (). The architecture of this generator network is also based on the U-Net architecture  in order to obtain better details in the generated image. Motivated by , where it was shown that a standard U-net cannot handle large pixel-to-pixel misalignment between the input and the output images, we propose using a deformation module to align the features of the encoder with the output images. Contrary to  that defines an affine transformation for each human body part in order to compute the feature deformation, we propose a deformation module that can be used on any object. In particular, we propose employing the optical flow to align the features with . The deformation employs a warping function that warps the feature maps according to :
This warping operation is implemented using a bilinear sampler, resulting in a fully differentiable model. Note that is down-sampled to via nearest neighbour interpolation when computing Eq. (3). Nevertheless, because of the small receptive field of the bilinear sampling layer, encoding the motion only via the deformation module leads to optimization problems. In order to facilitate network training, we propose inputing the decoder the difference of the keypoint locations encoded as heatmaps . Indeed, by providing to the decoder, the reconstruction loss applied on the outputs (see Sec. 3.5) is directly propagated to the keypoint detector without going through . In addition, the advantage of the heatmap difference representation is that it encodes both the locations and the motions of the keypoints. Similarly to , we compute tensors by down-sampling to . The two tensors and are concatenated along the channel axis and are then treated as skip-connection tensors by the decoder.
3.4 From Sparse Keypoints to Dense Optical Flow
In this section, we detail how we estimate the optical flow . The task of predicting a dense optical flow only from the displacement of a few keypoints and the appearance of the first frame is challenging. In order to facilitate the task of the network, we adopt a part base formulation. We make the assumption that each keypoint is located on an object part that is locally rigid. Thus, the task of computing the optical flow becomes simpler since, now, the problem consists in estimating masks that segment the object in rigid parts corresponding to each keypoint. A first coarse estimation of the optical flow can be given by:
where denotes the element-wise product and is the operator that returns a tensor by repeating the input vector times. Additionally, we employ one specific mask without deformation (which corresponds to ) to capture the static background. In addition to the masks , the motion network also predicts the residual motion . The purpose of this residual motion field is to refine the coarse estimation by predicting non-rigid motion that cannot be modeled by the part-based approach. The final estimated optical flow is: .
Concerning the inputs of the motion network, M takes two tensors, and corresponding respectively to the sparse motion and the appearance. However, we can observe that, similarly to the generator network, may suffer from the misalignment between the input and the output . Indeed, is aligned with . To handle this problem, we use the warping operator according to the motion field of each keypoint , e.g. . This solution provides images that are locally aligned with in the neighborhood of . Finally, we concatenate , and along the channel axis and feed them into a standard U-Net network. Similarly to the keypoint and the generator network, the use of U-Net architecture is motivated by the need of fine-grained details.
3.5 Network Training
We propose training the whole network in an end-to-end fashion. As formulated in Sec. 3.1, our loss ensures that is correctly reconstructed from , and . Following the recent advances in image generation, we combine an adversarial and the feature matching loss proposed in  in order to learn to reconstruct . More precisely, we use a discriminator network that takes as input concatenated with either the real image or the generated image . We employ the least-square GAN formulation  leading to the two following losses used to train the discriminator and the generator:
where denotes the concatenation along the channel axis. Note that in Eq (5), the dependence on the trained parameters of , , and appears implicitly via . Note that we provide the keypoint locations to the discriminator to help it to focus on moving parts and not on the background. However, when updating the generator, we do not propagate the discriminator loss gradient through to avoid that the generator tends to fool the discriminator by generating meaningless keypoints.
The GAN loss is combined with a feature matching loss that encourages the output image and to have similar feature representations. The feature representations employed to compute this loss are the intermediate layers of the discriminator . The feature matching loss is given by:
where denotes the -layer feature extractor of the discriminator . denotes the discriminator input. The main advantage of the feature matching loss is that, differently from other perceptual losses, [35, 24], it does not require the use of an external pre-trained network. Finally the overall loss is obtained by combining Eqs. (6) and (5), . In all our experiments, we chose following . Additional details of our implementation are shown in the Supplementary Material A.
3.6 Generation Procedure
At test time, our network receives a driving video and a source image. In order to generate the frame, estimates the keypoint locations in the source image. Similarly, we estimate the keypoint locations and from first and the frames of the driving video. Rather than generating a video from the absolute positions of the keypoints, the source image keypoints are transferred according to the relative difference between keypoints in the video. The keypoints in the generated frame are given by:
The keypoints and are then encoded as heatmaps using the covariance matrices estimated from the driving video, as described in Sec. 3.2. Finally, the heatmaps are given to the dense motion and the generator networks together with the source image (see Secs. 3.3 and 3.4). Importantly, one limitation of transferring relative motion is that it cannot be applied to arbitrary source images. Indeed, if the driving video object is not roughly aligned with the source image object, Eq. (7) may lead to absolute keypoint positions that are physically impossible for the considered object as illustrated in Supplementary Material C.1.
In this section, we present a in-depth evaluation on three problems, tested on three very different datasets and employing a large variety of metrics.
Datasets. The UvA-Nemo dataset  is a facial dynamics analysis dataset composed of 1240 videos We follow the same pre-processing as in . Specifically, faces are aligned using the OpenFace library  before re-sizing each frame to pixels. Each video starts from a neutral expression and lasts 32 frames. As in , we use 1110 videos for training and 124 for evaluation.
The Tai-Chi dataset  is composed of 4500 tai-chi video clips downloaded from YouTube. We use the data as pre-processed in . In particular, the frames are resized to pixels. The videos are split into 3288 and 822 videos for training and testing respectively. The video length varies from 32 to 100 frames.
The BAIR robot pushing dataset  contains videos collected by a Sawyer robotic arm pushing a variety of objects over a table. It contains 40960 training and 256 test videos. Each video is pixels and has 30 frames.
Evaluation Protocol. Evaluating the results of image animation methods is a difficult task, since ground truth animations are not available. In addition, to the best of our knowledge, X2Face  is the only previous approach for data-driven model-free image animation. For these two reasons, we evaluate our method also on two closely related tasks. As proposed in , we first evaluate Monkey-Net on the task of video reconstruction. This consists in reconstructing the input video from a representation in which motion and content are decoupled. This task is a “proxy” task to image animation and it is only introduced for the purpose of quantitative comparison. In our case, we combine the extracted keypoints of each frame and the first frame of the video to re-generate the input video. Second, we evaluate our approach on the problem of Image-to-Video translation. Introduced in , this problem consists of generating a video from its first frame. Since our model is not directly designed for this task, we train a small recurrent neural network that predicts, from the keypoint coordinates in the first frame, the sequence of keypoint coordinates for the other 32 frames. Additional details can be found in the Supplementary Material A. Finally, we evaluate our model on image animation. In all experiments we use K=10.
tableVideo reconstruction comparisons
Metrics. In our experiments, we adopt several metrics in order to provide an in-depth comparison with other methods. We employ the following metrics.
. In the case of the video reconstruction task where the ground truth video is available, we compare the average distance between pixel values of the ground truth and the generated video frames.
AKD. For the Tai-Chi and Nemo datasets, we employ external keypoint detectors in order to evaluate whether the motion of the generated video matches the ground truth video motion. For the Tai-Chi dataset, we employ the human-pose estimator in . For the Nemo dataset we use the facial landmark detector of . We compute these keypoints for each frame of the ground truth and the generated videos. From these externally computed keypoints, we deduce the Average Keypoint Distance (AKD), i.e. the average distance between the detected keypoints of the ground truth and the generated video.
MKR. In the case of the Tai-Chi dataset, the human-pose estimator returns also a binary label for each keypoint indicating whether the keypoints were successfully detected. Therefore, we also report the Missing Keypoint Rate (MKR) that is the percentage of keypoints that are detected in the ground truth frame but not in the generated one. This metric evaluates the appearance quality of each video frame.
AED. We compute the feature-based metric employed in  that consists in computing the Average Euclidean Distance (AED) between a feature representation of the ground truth and the generated video frames. The feature embedding is chosen such that the metric evaluates how well the identity is preserved. More precisely, we use a network trained for facial identification  for Nemo and a network trained for person re-id  for Tai-Chi.
FID. When dealing with Image-to-video translation, we complete our evaluation with the Frechet Inception Distance  (FID) in order to evaluate the quality of individual frames.
Furthermore, we conduct a user study for both the Image-to-Video translation and the image animation tasks (see Sec. 4.3).
4.1 Ablation Study
In this section, we present an ablation study to empirically measure the impact of each part of our proposal on the performance. First, we describe the methods obtained by “amputating” key parts of the model described in Sec. 3.1: (i) No - the dense optical flow network is not used; (ii) No - in the optical flow network , we do not use the part based-approach; (iii) No - in the Optical Flow network , we do not use ; (iv) No - we do not estimate the covariance matrices in the keypoint detector and the variance is set to as in ; (v) the source image is not given to the motion network , estimates the dense optical flow only from the keypoint location differences; (vi) Full denotes the full model as described in Sec. 3.
tableVideo reconstruction ablation study TaiChi.
In Tab. 4.1, we report the quantitative evaluation. We first observe that our full model outperforms the baseline method without deformation. This trend is observed according to all the metrics. This illustrates the benefit of deforming the features maps according to the estimated motion. Moreover, we note that No and No both perform worse than when using the full optical flow network. This illustrates that and alone are not able to estimate dense motion accurately. A possible explanation is that cannot estimate non rigid motions and that , on the other hand, fails in predicting the optical flow in the presence of large motion. The qualitative results shown in Fig. 4 confirm this analysis. Furthermore, we observe a drop in performance when covariance matrices are replaced with static diagonal matrices. This shows the benefit of encoding more information when dealing with videos with complex and large motion, as in the case of the TaiChi dataset. Finally, we observe that if the appearance is not provided to the deformation network , the video reconstruction performance is slightly lower.
4.2 Comparison with Previous Works
Video Reconstruction. First, we compare our results with the X2Face model  that is closely related to our proposal. Note that this comparison can be done since we employ image and motion representation of similar dimension. In our case, each video frame is reconstructed from the source image and 10 landmarks, each one represented by 5 numbers (two for the location and three for the symmetric covariance matrix), leading to a motion representation of dimension 50. For X2face, motion is encoded into a driving vector of dimension 128. The quantitative comparison is reported in Tab. 4. Our approach outperforms X2face, according the all the metrics and on all the evaluated datasets. This confirms that encoding motion via motion-specific keypoints leads to a compact but rich representation.
Image-to-Video Translation: In Tab. 1 we compare with the state of the art Image-to-Video translation methods: two unsupervised methods MoCoGAN  and SV2P , and CMM-Net which is based on keypoints . CMM-Net is evaluated only on Nemo since it requires facial landmarks. We report results SV2P on the Bair dataset as in . We can observe that our method clearly outperforms the three methods for all the metrics. This quantitative evaluation is confirmed by the qualitative evaluation presented in the Supplementary material C.3. In the case of MoCoGAN, we observe that the AED score is much higher than the two other methods. Since AED measures how well the identity is preserved, these results confirm that, despite the realism of the video generated by MoCoGAN, the identity and the person-specific details are not well preserved. A possible explanation is that MoCoGAN is based on a feature embedding in a vector, which does not capture spatial information as well as the keypoints. The method in  initially produces a realistic video and preserves the identity, but the lower performance can be explained by the apparition of visual artifacts in the presence of large motion (see the Supplementary material C.3 for visual examples). Conversely, our method both preserves the person identity and performs well even under large spatial deformations.
Image Animation. In Fig. 5, we compare our method with X2Face  on the Nemo dataset. We note that our method generates more realistic smiles on the three randomly selected samples despite the fact that the XFace model is specifically designed for faces. Moreover, the benefit of transferring the relative motion over absolute locations can be clearly observed in Fig. 5 (column 2). When absolute locations are transferred, the source image inherits the face proportion from the driving video, resulting in a face with larger cheeks. In Fig. 6, we compare our method with X2Face on the Tai-Chi dataset. X2Face  fails to consider each body-part independently and, consequently, warps the body in such a way that its center of mass matches the center of mass in the driving video. Conversely, our method successfully generates plausible motion sequences that match the driving videos. Concerning the Bair dataset, exemplar videos are shown in the Supplementary material C.3. The results are well in line with those obtained on the two other datasets.
tableUser study results on image animation. Proportion of times our approach is preferred over X2face .
4.3 User Evaluation
In order to further consolidate the quantitative and qualitative evaluations, we performed user studies for both the Image-to-Video translation (see the Supplementary Material C.3) and the image animation problems using Amazon Mechanical Turk.
For the image animation problem, our model is again compared with X2face  according to the following protocol: we randomly select 50 pairs of videos where objects in the first frame have a similar pose. Three videos are shown to the user: one is the driving video (reference) and 2 videos from our method and X2Face. The users are given the following instructions: Select the video that better corresponds to the animation in the reference video. We collected annotations for each video from 10 different users The results are presented in Tab. 4.2. Our generated videos are preferred over X2Face videos in almost more than 80% of the times for all the datasets. Again, we observe that the preference toward our approach is higher on the two datasets which correspond to large motion patterns.
We introduced a novel deep learning approach for image animation. Via the use of motion-specific keypoints, previously learned following a self-supervised approach, our model can animate images of arbitrary objects according to the motion given by a driving video. Our experiments, considering both automatically computed metrics and human judgments, demonstrate that the proposed method outperforms previous work on unsupervised image animation. Moreover, we show that with little adaptation our method can perform Image-to-Video translation. In future work, we plan to extend our framework to handle multiple objects and investigate other strategies for motion embedding.
This work was carried out under the “Vision and Learning joint Laboratory” between FBK and UNITN.
-  (2016) OpenFace: a general-purpose face recognition. Cited by: 4th item, §4.
-  (2017) Stochastic variational video prediction. In ICLR, Cited by: §2, §C.3, §C.3, §C.3, Table 2, §4.2, Table 1.
-  (2018) Synthesizing images of humans in unseen poses. In CVPR, Cited by: §1, §3.1.
-  (2018) Recycle-gan: unsupervised video retargeting. In ECCV, Cited by: §1, §2.
-  (1999) A morphable model for the synthesis of 3d faces. In SIGGRAPH, Cited by: §1, §2.
-  (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, Cited by: §3.2, 2nd item.
-  (2014) Displaced dynamic expression regression for real-time facial tracking and animation. TOG. Cited by: §1, §2.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: 2nd item.
-  (2018) Everybody dance now. In ECCV, Cited by: §1, §2.
-  (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS, Cited by: §A.
-  (2012) Are you really smiling at me? spontaneous versus posed enjoyment smiles. In ECCV, Cited by: §1, §4.
-  (2017) Self-supervised visual planning with temporal skip connections. In CoRL, Cited by: §1, §4.
-  (2018) A variational u-net for conditional appearance and shape generation. In CVPR, Cited by: 4th item.
-  (2016) Unsupervised learning for physical interaction through video prediction. In NIPS, Cited by: §2.
-  (2016) Deepwarp: photorealistic image resynthesis for gaze manipulation. In ECCV, Cited by: §3.1.
-  (2019) 3D guided fine-grained face manipulation. In CVPR, Cited by: §1.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §A.
-  (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737. Cited by: 4th item.
-  Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: 5th item.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §A.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1, §3.1.
-  (2018) Unsupervised learning of object landmarks through conditional image generation. In NIPS, Cited by: §3.1, §3.1, §4.1.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.5.
-  (2018) Generating a fusion image: one’s identity and another’s shape. In CVPR, Cited by: §1.
-  (2016) Video pixel networks. In ICML, Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. In ICLR, Cited by: §A.
-  (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1.
-  Least squares generative adversarial networks. In ICCV, Cited by: §3.5.
-  (2015) Action-conditional video prediction using deep networks in atari games. In NIPS, Cited by: §2.
-  (2019) Laplace landmark localization. arXiv:1903.11633. Cited by: §3.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §3.3.
-  (2018) Semantic-fusion gans for semi-supervised satellite image classification. In ICIP, Cited by: §1.
-  (2017) Temporal generative adversarial nets with singular value clipping. In ICCV, Cited by: §1, §2.
-  (2018) Deformable gans for pose-based human image generation. In CVPR, Cited by: §1, §3.1, §3.3, §3.5.
-  (2019) Whitening and coloring transform for GANs. In ICLR, Cited by: §1.
-  (2018) Enhancing perceptual attributes with bayesian style generation. In ACCV, Cited by: §1.
-  (2015) Unsupervised learning of video representations using lstms. In ICML, Cited by: §2.
-  (2018) GestureGAN for hand gesture-to-gesture translation in the wild. In ACM MM, Cited by: §1.
-  (2019) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In CVPR, Cited by: §1.
-  (2019) Dual generator generative adversarial networks for multi-domain image-to-image translation. In ACCV, Cited by: §1.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In CVPR, Cited by: §1, §2.
-  (2018) Mocogan: decomposing motion and content for video generation. In CVPR, Cited by: §1, §1, §2, §C.3, §C.3, §C.3, §C.3, Table 2, §4.2, Table 1, Table 1, Table 1, §4.
-  (2017) Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435. Cited by: §2.
-  (2017) Learning to generate long-term future via hierarchical prediction. In ICML, Cited by: §2.
-  (2016) Generating videos with scene dynamics. In NIPS, Cited by: §2, §4.
-  (2018) Video-to-video synthesis. In NIPS, Cited by: §1, §2.
-  (2017) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, Cited by: §A, §3.5, §3.5.
-  (2018) Every smile is unique: landmark-guided diverse smile generation. In CVPR, Cited by: §1, §2, §C.3, §C.3, Table 2, §4.2, Table 1, §4.
-  (2018) X2Face: a network for controlling face generation using images, audio, and pose codes. In ECCV, Cited by: §1, §1, §2, §C.4, §4.2, §4.2, §4.2, §4.3, §4.
-  (2018) Unsupervised discovery of object landmarks as structural representations. In CVPR, Cited by: §3.1.
-  Learning to forecast and refine residual motion for image-to-video generation. In ECCV, Cited by: §2.
-  State of the art on monocular 3d face reconstruction, tracking, and applications. In Computer Graphics Forum, Cited by: §2.
In this supplementary material, we provide implementation details (Sec. A), introduce a new dataset (Sec. B) and report additional experimental results (Sec. C). Additionally we provide a video file with further qualitative examples.
A Implementation details
As described in Sec. 3, each module employs a U-Net architecture. We use the exact same architecture for all the networks. More specifically each block of each of the encoder consists of a convolution, batch normalization , ReLU and average pooling. The first convolution layers have 32 filters and each subsequent convolution doubles the number of filters. Each encoder is composed of a total of 5 blocks. The decoder blocks have similar structure: convolution, batch normalization and ReLU followed by nearest neighbour up-sampling. The first block of the decoder has 512 filters. Each consequent block has the reduced number of filters by a factor of 2.
As described in Sec. 3.2, the keypoint detector produces heatmaps followed by softmax. In particular, we employ softmax activations with 0.1 temperature. Indeed, thanks to the use of a low temperature for softmax, we obtain sharper heatmaps and avoid uniform heatmaps that would lead to keypoints constantly located in the image center.
For , we employ 4 additional Residual Blocks  in order to remove possible warping artifacts produces by . The output of is a 3 channel feature map followed by the sigmoid. We use the discriminator architecture described in .
The framework is trained for epochs where equals 250, 500 and 10 for Tai-Chi, Nemo and Bair respectively. Epoch involves training the network on 2 randomly sampled frames from each training video. We use the Adam optimizer  with learning rate 2e-4 and then with learning rate 2e-5 for another epochs.
As explained in Sec. 4.2, for Image-to-Video translation, we employ a single-layer GRU network in order to predict the keypoint sequence used to generate the video. This recurrent network  has 1024 hidden units and is trained via minimization.
B MGif dataset
We collected an additional dataset of videos containing movements of different cartoon animals. Each video is a moving gif file. Therefore, we called this new dataset MGif. The dataset consists of 1000 videos, we used 900 videos for training and 100 for evaluation. Each video has size and contains from 5 to 100 frames. The dataset is particularly challenging because of the high appearance variation and motion diversity. Note that in the experiments on this dataset, we use absolute keypoint locations from the driving video instead of the relative keypoint motion detailed in Sec. 3.6.
C Additional experimental results
In this section, we report additional results. In Sec. C.1 we visually motivate our alignment assumption, in Sec. C.2 we complete the ablation study and, in Secs. C.3 and C.4, we report qualitative results for both the image-to-video and image animation problems. Finally, in Sec. C.5, we visualize the keypoint predicted by our self-supervised approach.
c.1 Explanation of alignment assumption
Our approach assumes that the object in the first frame of the driving video and the object in the source image should be in similar poses. This assumption was made to avoid situations of meaningless motion transfer as shown in Fig. 7. In the first row, the driving video shows the action of closing the mouth. Since the mouth of the subject in the source image is already closed, mouth disappears in the generated video. Similarly, in the second row, the motion in the driving video shows a mouth opening sequence. Since the mouth is already open in the source image, motion transfer leads to unnaturally large teeth. In the third row the man is asked to raise a hand, while it has already been raised.
c.2 Additional ablation study
We perform experiments to measure the impact of the number of keypoints on video reconstruction quality. We report results on Tai-Chi dataset in Fig. 8. We computed and AKD metrics as described in the paper. As expected, increasing the number of keypoints leads to a lower reconstruction error, but additional keypoints introduce memory and computational overhead. We use 10 keypoints in all our experiments, since we consider this to be a good trade-off.
c.3 Image-to-Video translation
As explained in Sec. 4.2 of the main paper, we compare with the three state of the art methods for Image-to-Video translation: MoCoGAN  and SV2P , and CMM-Net . CMM-Net is evaluated only on Nemo and SV2P only on the Bair dataset. We report a user study and qualitative results.
User Study. We perform a user study for the image-to-video translation problem. As explained in Sec. 4.3, we perform pairwise comparisons between our method and the competing methods. We employ the following protocol: we randomly select 50 videos and use the first frame of each of video as the reference frames to generate new videos. For each of the 50 videos the initial frame, and two videos generated by our and one of the competing methods are shown to the user. We provide the following instructions: ”Select a more realistic animation of the reference image”. As in Sec. 4.2 of the main paper, our method is compared with MoCoGAN , Sv2p , and CMM-Net . The results of the user study are presented in Table 2. On average, users preferred the videos generated by our approach over those generated by other methods. The preference gap is especially evident for the Tai-Chi and Bair datasets that contain a higher amount of large motion. This supports the ability of our approach to handle driving videos with large motion.
Qualitative results. We report additional qualitative results in Figs. 9, 10 and 11. These qualitative results further support the ability of our method to generate realistic videos from source images and driving sequences.
In particular, for the Nemo dataset (Fig. 10), MoCoGAN and CMM-Net suffer from more artifacts. In addition, the videos generated by MoCoGAN do not preserve the identity of the person. This issue is particularly visible when comparing the first and the last frames of the generated video. CMM-Net preserves better the identity but fails in generating realistic eyes and teeth. In contrast to these works, our method generates realistic smiles while preserving the person identity.
For Tai-Chi (Fig. 9), MoCoGAN  produces videos where some parts of the human body are not clearly visible (see rows 3,4 and 6). This is again due to the fact that visual information is embedded in a vector. Conversely, our method generates realistic human body with richer details.
For Bair (Fig. 11),  completely fails to produce videos where the robotic is sharp. The generated videos are blurred. MoCoGAN  generates videos with more details but containing many artifacts. In addition, the backgrounds generated by MoCoGAN are not temporally coherent. Our method generates realistic robotic arm moving in front of detailed and temporally coherent backgrounds.
c.4 Image animation
When tested using the Nemo dataset (Fig. 12), our method generates more realistic smiles on most of the randomly selected samples despite the fact that the XFace model is specifically designed for faces. Similarly to the main paper, the benefit of transferring the relative motion over absolute locations can be clearly observed in the bottom right example where the video generated by X2face inherits the large cheeks of the young boy in the driving video.
For Tai-Chi (Fig. 13), X2face is not able to handle the motion of the driving video and simply warps the human body in the source image as a single blob.
For Bair (Fig. 14), we observe a similar behavior. X2face generates unrealistic videos where the robotic arm is generally not distinguishable. On the contrary, our model is able to generate a realistic robotic arm moving according to the driving video motion.
Finally in Fig 15, we report results on the MGif dataset. First, these examples illustrate high diversity of MGif dataset. Second, we observe that our model is able to transfer the motion of the driving video even if the appearance of the source frame is very different from the driving video. In particular, in all the generated sequences, we observe that the legs are correctly generated and follow the motion of the driving video. The model preserves the rigid parts of the animals as, for instance, the abdomen. In the last row, we see that the model is also able to animate the fox tail according to the motion of the cheetah tail.
c.5 Keypoint visualization
Finally, we report visual examples of keypoints learned by our model in Figs. 16, 17, 18 and 19. On the Nemo dataset, we observe that the obtained keypoints are semantically consistent. For instance, the cyan and light green keypoints constantly correspond to the nose and the chin respectively. For Tai-Chi, the keypoints are also semantically consistent: light green for the chin and yellow for the left-side arm (right arm in frontal views and left arm in back views), for instance. For the Bair dataset, we observe that two keypoints (light green and dark blue) correspond to the robotic arm. The other keypoints are static and can correspond to the background. Finally, concerning the MGif dataset, we observe that each keypoint corresponds to two different animal parts depending if the animal is going towards left or right. In the case of animals going right (last three rows), the keypoints are semantically consistent (red for the tail, dark blue for the head etc.). Similarly, the keypoints are semantically consistent among images of animal going left (red for the head, dark blue for the tail etc.). Importantly, we observe that a keypoint is associated to each highly moving part, as legs and tails.