Neural Voice Puppetry: Audio-driven Facial Reenactment
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques.
–We highly recommend to watch the supplemental video–
mncolorRGB255,50,00 \definecolorjtcolorRGB0,0,255 \definecolormecolorRGB250,150,10 \definecoloratcolorRGB10,120,10 \definecolortodocolorRGB255,0,00
In the recent years, speech-based interaction with computers made significant progress. Digital voice assistants are now ubiquitous due to their integration into many commodity devices such as smartphone, tvs, cars, etc.; even companies use more and more machine learning techniques to drive service bots that interact with their customers. These virtual agents aim for a user-friendly man-machine interface while keeping maintenance costs low. However, a significant challenge is to appeal to humans by delivering information through a medium that is most comfortable to them. While speech-based interaction is already very successful, such as shown in virtual assistants like Siri, Alexa, Google, etc., the visual counterpart is largely missing. This comes to no surprise given that a user would also like to associate the visuals of a face with the generated audio, similar to the ideas behind video conferencing. In fact, the level of engagement for audio-visual interactions is higher than for purely audio ones [9, 25].
The aim of this work is to provide the missing visual channel by introducing Neural Voice Puppetry, a photo-realistic facial animation method that can be used in the scenario of a visual digital assistant. To this end, we build on the recent advances in text-to-speech synthesis literature [32, 15], which is able to provide a synthetic audio stream from a text that can be generated by a digital agent. As visual basis, we leverage a short target video of a real person. The key component of our method is to estimate lip motions that fit the input audio and to render the appearance of the target person in a convincing way. This mapping from audio to visual output is trained using the ground truth information that we can gather from a target video (aligned real audio and image data). We designed Neural Voice Puppetry to be an easy to use audio-to-video translation tool which does not require vast amount of video footage of a single target video or any manual user input. In our experiments, the target videos are comparably short (2-3 min), thus, allowing us to work on a large amount of video footage that can be downloaded from the Internet. To enable this easy applicability to new videos, we generalize specific parts of our pipeline. Specifically, we compute a latent expression space that is generalized among multiple persons (in our experiments ). This also ensures the capability of being able to handle different audio inputs. Besides the generation of a visual appearance of a digital agent, our method can also be used as audio-based facial reenactment. Facial reenactment is the process of re-animating a target video in a photo-realistic manner with the expressions of a source actor. In the recent years, facial reenactment has witnessed a growing interested from the research community [29, 36]. This enables a variety of applications, ranging from consumer-level teleconferencing through photo-realistic virtual avatars [27, 28, 22] to movie production applications such as video dubbing [11, 18]. Recently, several authors started to exploit the audio signal for facial reenactment [24, 3, 33]. This has the potential of avoiding failures of visual-based approaches, when the visual signal is not reliable, e.g., due to occluded face, noise, distorted views and so on. Many of these approaches, however, lack video-realism [3, 33]. An exception is the work of Suwajanakorn et al. , where they have shown photo-realistic videos of President Obama that can be synthesized just from the audio signal. This approach, however, requires very large quantities of data for training (17 hours of President Obama weekly speeches) and, thus, limits its application and generalization to other identities.
To summarize, we propose a generalized audio-driven facial animation approach that
can be trained on ’in-the-wild’ portrait videos (2-3 min per target video).
includes a representation of person-specific talking styles (i.e., we preserve the talking style of the target video).
can be driven by synthetic voices generated from text-to-speech approaches, thus, enabling the transfer from text to facial animations without the need of video-text annotations.
is able to render photo-realistic video content of a target actor that is in sync with the speech using a novel neural rendering technique.
2 Related Work
Neural Voice Puppetry is a facial reenactment approach based only on audio input. In the literature, there are many video-based facial reenactment systems that enable dubbing and other general facial expression manipulation. Our focus in this related work section lies on the audio-based methods. These methods can be organized in facial animation and facial reenactment. Facial animation concentrates on the prediction of expressions that can be applied to a predefined avatar. In contrast, audio-driven facial reenactment aims to generate photo-realistic videos of an existing person including all idiosyncrasies from an audio signal. In the following, we will discuss these and related fields.
Video-Driven Facial Reenactment
Recently, several works haven been proposed for video-driven facial reenactment which are also covered by a state-of-the-art report of Zollhöfer et al. . A source and target face are first reconstructed using a parametric face model. The target face is reenacted by replacing its expression parameters with that of the source face. Thies et al.  uses a static skin texture and a data-driven approach to synthesize the mouth interior. In Deep Video Portraits  a generative adversarial network is used to produce photo-realistic skin texture that can handle skin deformations with synthetic renderings as input. Recently, Thies et al.  proposed neural textures, a high-dimensional feature maps learned during scene capture and accessed throuh a UV look up. A defereed neural renderer refines the reconstruction. Results show that neural textures can generate high quality facial reenactments. For instance, it produces higher fidelity mouth interiors with less artifacts. Kim et al.  analyzed the notion of style for facial expressions and showed its importance for dubbing. They define the style as a person-specific temporal global signature that can be related to the speaker’s facial geometry and personality, e.g., how a person speaks, smiles. An audio-visual reenactment technique with a focus on dubbing has been proposed by Garrido et al. . The dubbing language track is used to force lip closure by detecting bilabial consonants /m/, /p/, and /b/. The synthesized face is rendered using the estimated target lighting and skin reflectance.
Audio-Driven Facial Animation
Audio-driven facial animation is the field of generating animations for predefined 3D facial avatars from audio inputs. These methods do not focus on photo-realistic results, but on the prediction of facial motions. There is a variety of proposed techniques in the literature, we will focus on the most relevant publications.
Karras et al.  drives a 3D facial animation using an LSTM that maps input waveforms to the 3D vertex coordinates of a face mesh, also considering the emotional state of the person. In contrast to our method it needs high quality 3D reconstructions for supervised training and does not render photo realistic output. Taylor et al.  presented a technique to animate different avatars by any input speaker. To handle different input speakers, the audio signal is first converted into a phoneme transcript using an off the shelf speech recognition. A deep neural network then maps the phonemes into the parameters of a reference face model. The network is trained on data collected for only one person speaking for 8 hours. They show animations of different synthetic avatars using deformation retargeting. VOCA  is an end-to-end deep neural network for speech-to-animation translation trained on multiple subjects. Similar to our approach, a low-dimensional audio embedding based on features of the DeepSpeech network  is used. From this embedding, VOCA regresses 3D vertices on a FLAME face model  conditioned on a subject label. In contrast to our method, it requires high quality 4D scans recorded in a studio setup. Our approach works on ’in the wild’ videos, with a focus on temporally coherent predictions and photo-realistic renderings. Tzirakis et al.  presented Deep Canonical Attentional Warping (DCAW) which is trained to map the audio signal to expression blendshape parameters. Trained on the Lip Reading Words (LRW) dataset , the DCAW network learns to warp the words of an input video to the words of the LRW dataset. Result show the ability of the approach to generalize to different speakers.
Audio-Driven Facial Reenactment
The most relevant literature is in the field of audio-driven facial reenactment that has the goal to generate photo-realistic videos that are in sync with the input audio stream. A number of techniques are available for audio-driven facial reenactment [2, 8, 24, 3, 33]. Suwajanakorn et al.  uses an audio stream from President Barack Obama to synthesize a high quality video of him speaking. A Recurrent Neural Network is trained on many hours of his speech to learn the mouth shape from the audio. The mouth is then composited with proper 3D matching to reanimate an original video in photo-realistic manner. Because of the huge amount of used training data (17h), it is not applicable to other target actors. In contrast, our approach only needs a 2-3 min long video of a target sequence. Chung et al.  present a technique that animates the mouth of a still, normalized image to follow an audio speech. First, the image and audio is projected into a latent space through a deep encoder. A decoder then utilizes the joint embedding of the face and audio to synthesize the talking head. The technique is trained on tens of hours of data in an unsupervised manner. In contrast to our method, it is a pure 2D image based method. Another 2D image-based method has been presented by Vougioukas et al. . They us a temporal GAN to produce a video of a talking face given a still image and an audio signal as input. The generator feeds the still image and the audio to an encoder-decoder architecture with a RNN to better capture temporal relations. It uses discriminators that work on per-frame and on a sequence level to improve temporal coherence. As conditioning, it also takes the audio signal as input to enforce the synthesized mouth to be in sync with the audio. In  a dedicated mouth-audio syn discriminator is used to improve the results.
Text-Based Video Editing
Fried et al.  presented a technique for text-based editing of videos. Their approach allows overwriting existing video segments with new texts in a seamless manner. A face model  is registered to the examined video and a viseme search finds video segments with similar mouth movements to the editing text. The corresponding face parameters of the matching video segment are blended with the original sequence parameters based on a heuristic, followed by a deep renderer to synthesize photo-realistic results. The method is person specific and requires a one hour long training sequence of the target actor and, thus, is not applicable to short videos from the Internet. The viseme search is slow (min for three words) and does not allow for interactive results.
Neural Voice Puppetry consists of two main parts (see Fig. 1), a generalized network that predicts a latent expression vector, thus, spanning an ’audio-expression’ space. This audio-expression space is shared among all persons and allows for reenactment, i.e., transferring the predicted motions from one person to another. The audio expressions are interpreted as blendshape coefficients of a 3D face model rig. This face model rig is person-specific and is optimized in the second part of our pipeline. This specialized stage captures the idiosyncrasies of a target person including the facial motion and appearance. It is trained on a short video sequence of minutes (in comparison to hours that are required by state-of-the-art methods). The facial motions are represented as delta-blendshapes which we constrain to be in the subspace of a generic face template [1, 29]. A neural texture in conjunction with a deferred neural renderer is used to store the appearance of the face of an individual person. In the following, we first focus on the required data to train the generalized and the specialized components. An advantage of our method is that we do not need a studio setup to retrieve the data, but we can use short video clips downloaded from the internet. Based on this data, we train a generalized temporal-coherent audio-expression estimation network that is used to drive person-specific video avatars.
Learning-based approaches heavily rely on the data they are trained on. In contrast to previous model-based methods, Neural Voice Puppetry is based on ’in-the-wild’ videos that can be download from the internet. Especially, we do only require RGB videos without the need of complex capturing setups and specific lighting. The videos have to be synced with the audio stream, such that we can extract ground truth pairs of audio features and image content. In our experiments the videos have a resolution of with .
Training Corpus for the Audio2ExpressionNet
Fig. 2 shows an overview of our video training corpus that is used for the training of the small network that predicts the ’audio expressions’ from the input audio features (see Sec. 5.1). The dataset consists of videos with an average length of (in total frames). We selected the training corpus, such that the persons are in a neutral mood (commentators of the German public TV).
For a new target sequence, we extract the person-specific talking style in the sequence. I.e., we compute a mapping from the generalized audio expression space to the actual facial movements of the target actor (see Sec. 5.3). The target sequences have a length of and, thus, are easy to obtain from the Internet. The video data is also used to train a person-specific rendering network.
We preprocess the input data to extract face tracking information as well as audio features. The preprocessing is done automatically, no manual interaction is required.
3D Face Tracking:
Our method is using a statistical face model and delta-blendshapes [1, 29] to represent a 3D latent space for modelling facial animation. The 3D face model space reduces the face space to only a few hundred parameters ( for shape, for albedo and for expressions) and stays fixed in this work. Using the dense face tracking method of Thies et al. , we estimate the model parameters for every frame of a sequence. Note that the shape and albedo parameters are shared between all frames of one sequence. During tracking, we extract the per-frame expression parameters that are used to train the audio to expression network. To train the neural renderer, we also store the rasterized texture coordinates of the reconstructed face mesh.
The video contains a synced audio stream. We use the recurrent feature extractor of the pre-trained speech-to-text model DeepSpeech  (v0.1.0). Similar to Voca , we extract a window of character logits per video frame. Each window consists of time intervals à ms, resulting in an audio feature of . The DeepSpeech model is generalized among thousands of different voices, trained on Mozilla’s CommonVoice dataset.
To enable photo-realistic facial reenactment based on audio signals, we employ a 3D face model as intermediate representation of facial motion. A key component of our pipeline is the audio-based expression estimation. Since every person has its own talking style and, thus, different expressions, we establish person-specific expression spaces that can be computed for every target sequence. To ensure generalization among multiple persons, we created a latent audio expression space that is shared by all persons. From this audio expression space, one can map to the person specific expression space, enabling reenactment. Given the estimated expression and the extracted audio features, we apply a novel deferred neural rendering technique that generates the final output image.
Our method is designed to generate temporally smooth predictions of facial motions. To this end, we employ a deep neural network with two stages. First, we predict per-frame facial expression predictions. These expressions are potentially noisy, thus, we use an expression aware temporal filtering network. Given the noisy per-frame predictions as input the neural network predicts filter weights to compute smooth audio-expressions for a single frame. The per-frame as well as the filtering network can be trained jointly and outputs audio expression coefficients. This audio-expression space is shared among multiple persons and is interpreted as blendshape coefficients. Per person we compute a blendshape basis which is in the subspace of our generic face model . The networks are trained with a loss that works on a vertex level of this face model.
Per-frame Audio-Expression Estimation Network
Since our goal is a generalized audio-based expression estimation, we rely on generalized audio features. We use the RNN-part of the speech to text approach DeepSpeech  to extract these features. These features represent the logits of the DeepSpeech alphabet for audio signal. For each video frame, we extract a time window of features around the frame that consist of logits (length of the DeepSpeech alphabet is ). This, tensor is input to our per-frame estimation network. To map from this feature space to the unfiltered audio-expression space, we apply convolutional layer and fully connected layer. Specifically, we apply 2D convolutions with kernel dimensions and stride , thus, filtering in the time dimension. The convolutional layers have a bias and are followed by a leaky ReLU (slope ). The feature dimensions are reduced successively from ,,, to . This reduced feature is input to the fully connected layers that have a bias and are also followed by a leaky ReLU (), except the last layer. The fully connected layers map the features from the convolutional network to , then to and, finally, to the audio expression space of dimension , where a TanH activation is applied.
Temporally Stable Audio-Expression Estimation
To generate temporally stable audio-expression predictions, we jointly learn a filtering network that gets per-frame estimates as input (see Fig. 3). Specifically, we estimate the audio-expressions for frame using a linear combination of the per-frame predictions of the timesteps to . The weights for the linear combination are computed using a neural network that gets the audio-expressions as input (which results in an expression-aware filtering). The filter weight prediction network consists of five 1D convolutions followed by a linear layer with softmax activation (see supplemental material for detailed description). This content aware temporal filtering is also inspired by the self-attention mechanism .
To retrieve the 3D model from this audio-expression space, we learn a person-specific audio expression blendshape basis which we constrain by the generic blendshape basis of our statistical face model. I.e., the audio-expression blendshapes of a person are a linear combination of the generic blendshapes. This linear relation, results in a linear mapping from the audio expression space which is output of the generalized network to the generic blendshape basis. This linear mapping is person specific, resulting in matrices with dimension during training ( being the number of training sequences and being the number of generic blendshapes).
The network and the mapping matrices are learned end-to-end using the visually tracked training corpus. We employ a vertex-based loss function, with a higher weight (10x) on the mouth region of the face model. Specifically, we compute a vertex-to-vertex distance from the audio-based predicted and the visually tracked face model in terms of a root mean squared (RMS) distance:
with , the vertices based on the filtered expression estimation of frame and being the visual tracked face vertices. In addition to the absolute loss between predictions and the visual tracked face geometry, we also use a temporal loss that considers the vertex displacements of consecutive frames:
These forward, backward and central differences are weighted with (in our experiments ). The losses are measured in millimeters.
5.2 Neural Face Rendering
Based on the recent advances in neural rendering, we employ a deferred neural rendering technique that is based on neural textures to store the appearance of a face . Our rendering pipeline synthesizes the lower face in the target video based on the audio-driven expression estimations. Specifically, we use two networks. One network that focuses on the face interior, and another network that embeds this rendering into the original image. The estimated 3D face model is rendered using the rigid pose observed from the original target image. We render a neural texture with a resolution of . The network for the face interior translates these rendered feature descriptors to RGB colors. The network is using a similar structure as a classical U-Net with 5 layers. But instead of using strided convolutions that result in a downsampling in each layer, we are using dilated convolutions with increasing dilation factor and a stride of one. Instead of transposed convolutions we are using standard convolutions. All convolutions have kernel size . Note, dilated instead of strided convolutions do not increase the number of learnable parameters, but it increases the memory load during training and testing. Dilated convolutions help to reduce visual artifacts and result in smoother results (also temporally). The second network that blends the face interior with the ’background image’ has the same structure. To remove potential movements of the chin in the background image, we erode the background image around the rendered face. Thus, the task of this second network is to inpaint the region between the face and the background.
We use a per-frame loss function that is based on an loss to measure absolute errors and a VGG style loss .
with being the final synthetic image, the ground truth image and the intermediate result of the first network that focuses on the face interior (loss is masked to this region).
Our training procedure has two stages – the generalization and the specialization phase. In the first phase, we optimize for the shared network parameters that enable a generalization among different source actors. Specifically, we train the audio-based expression estimation among all sequences from our dataset (see Sec. 4) in a supervised fashion. Given the face tracking information acquired in an automatic data preprocessing step, we know the 3D face model of a specific person for every frame. In the training process, we reproduce these 3D reconstructions based on the audio input by optimizing the network parameters and the person-specific mapping from the audio expression space to the 3D space.
In the second phase, the rendering network for a specific target sequence is trained. Given the ground truth images and the visual tracking information, we train the deferred neural renderer end-to-end including the neural texture.
Our pipeline is implemented in PyTorch. For both stages we are using the Adam  optimizer with default settings (, , ) and a learning rate of . The Audio2ExpressionNet is trained for epochs (resulting in a training time of hours on a Nvidia 1080Ti) with a learning rate decay for the last epochs, a batch size of and Xavier initialization. The rendering networks are also trained for epochs for each target person individually with a batch size of ( hours training time, hours in case of strided convolutions).
New target video
Since the audio-based expression estimation network is generalized among multiple persons, we can apply it to unseen actors. The person specific mapping between the predicted audio expression space coefficients and the expression space of the new person can be obtained by solving a linear system of equations. Specifically, we extract the audio-expression for all training images and compute the linear mapping to the expressions that are visually estimated. In addition to this step, the person-specific rendering network for the new target video is trained from scratch.
At test time, we only require a source audio sequence. Based on the target actor selection, we use the corresponding person-specific mapping. The mapping from the audio features to the person specific expression space takes less than on an Nvidia 1080Ti. Generation of the 3D model and the rasterization using these predictions takes another . The deferred neural rendering takes which results in a real-time capable pipeline.
Our pipeline is trained on real video sequences, where the audio is in sync with the visual content.
Thus, we learned a mapping directly from audio to video that ensures synchronicity.
Instead of going directly from text to video, where such a natural training corpus is not available, we synthesize voice from the text and feed this into our pipeline.
For our experiments we used samples from the DNN-based text-to-speech demo of IBM Watson
Neural Voice Puppetry has several important use cases, i.e., audio-driven video avatars, video dubbing and text-driven video synthesis of a talking head, see supplemental video. In the following sections, we discuss these results including comparisons to state-of-the-art approaches.
6.1 Audio-driven model-based video avatars:
First we discuss model-based video avatars that can be controlled by audio input. In Fig. 4 we show a representative image from a comparison to Taylor et al. , Karras et al.  and Suwajanakorn et al. . All three methods were published at SIGGRAPH 2017, where this sequence has been shown as a direct comparison, thus, all results are generated by the original implementation of the authors. Only the method of Suwajanakorn et al. is able to produce photo-realistic output. The method is fitted to the scenario where a large video dataset of the target person is available and, thus, limited in its applicability. They demonstrate it on sequences of Obama, using hours of training data and hours for validation. In contrast, our method works on short min target video clips. In our supplemental video, we show multiple comparisons to Voca . Fig. 5 shows an image of a legacy Winston Churchill sequence. In contrast to Voca, our aim is to generate photo-realistic output videos that are in sync with the audio. Voca focuses on the 3D geometry requiring a 4D training corpus, while our approach uses a 3D proxy only as an intermediate step and works on videos from the Internet. Our 3D proxy is based on a generic face model and, thus, has not the details as a person-specific modelled mesh. Nevertheless, using our neural rendering approach, we are able to generate photo-realistic results.
6.2 Audio-driven 2D-based image avatars:
’You said that?’  is a GAN-based method that works without an explicit 3D model. It is operating in a normalized space of facial imagery (cropped, frontal faces) and needs a single image of the target person. In contrast, our method employs a 3D model to ensure 3D consistent movements in the output video. Instead of a normalized image, we generate an output that is embedded in a real video (see Fig. 6). Similar to ’You said that?’, Vougioukas et al.  generate talking head animations from a still image, including movements of eyebrows and eye blinks. Fig. 6 also shows a comparison to this method on a sequence of President Trump that is driven by the audio of an impersonator.
6.3 Video dubbing:
State-of-the-art video dubbing is based on video-driven facial reenactment [11, 29, 19, 30, 18]. In contrast, our method is only relying on the voice of the dubber. The ’Deferred Neural Rendering’  is a generic neural rendering approach, but the authors also show the usage in the scenario of facial reenactment. It builds upon the Face2Face  pipeline and directly transfers the deformations from the source to the target actor. Thus, tracking errors that occur in the source video (e.g., due to occlusions or fast motions) are transferred to the target video. In a dubbing scenario, the goal is to keep the talking style of the target actor which is not the case for [11, 29, 19, 30]. To compensate the influence of the source actor talking style, Kim et al.  proposed a method to map from the source style to the target actor style. Our approach directly operates in the target actor expression space, thus, no mapping is needed (we also do not capture the source actor style). This enables us to also work on strong expressions, as shown in Fig. 7.
6.4 Text-driven video synthesis
Fried et al. presented ’Text-based Editing of Talking-head Video’  which provides a video editing tool that is based on the transcript of the video. The method reassembles captured expression snippets from the target video, requiring blending heuristics. To achieve their results they rely on more than one hour of training data. We show a direct comparison to this method in the supplemental video. Note, our method only uses the synthetic audio sequence as input, while the method of Fried et al. uses both the transcript and the audio. In the comparison our method generates the entire video, while the text-based editing method only synthesizes the frames of the new three words.
6.5 Ablation studies
We use self-reenactment to evaluate our pipeline (Fig. 8), since it gives us access to a ground truth video sequence where we can also retrieve visual face tracking. As a distance measurement, we use an distance in color space (colors in [0,1]). Using this measure, we evaluate the rendering network (assuming good visual face tracking) and the entire pipeline. Specifically, we compare the results using visual tracked mouth movements to the results using audio-based predictions (see video). The mean color difference of the re-rendering on a test sequence of frame is for the visual and for the audio-based expressions.
In the supplemental video we also show a side-by-side comparison of our rendering network using dilated convolutions and our network with strided convolutions (and a kernel size of to reduce block artifacts in the upsampling). Both networks are trained with the same number of epochs (). As can be seen, dilated convolutions lead to visually more pleasing results (smoother in spatial and temporal domain).
Our results are covering different target persons which demonstrates the wide applicability of our method and that we are able to map the generalized audio-expression space to different person-specific talking styles and appearances. As can be seen in the supplemental video, the expression estimation network that is trained on multiple target sequences ( frames) results in more coherent predictions than the network solely trained on a sequence of Obama ( frames). The usage of more target videos increases the training corpus size and the variety of input voices and, thus, leads to more robustness.
Our training corpus is based on German news speakers. Nevertheless, most of our results are in English and show a good transferability. In the video we also show a comparison of the transfer from different source languages to different target videos that are originally also in different languages.
We further quantify the output quality of our approach in a user study. To this end, we show sequences of the competing methods, as well as results of our method. The attendees with a computer science background judged upon synchronicity and visual quality (’very bad’, ’bad’, ’neither bad nor good’, ’good’, ’very good’). The study consists of videos which are presented to the user in randomized order. See supplemental material for the collection of videos we used and the statistics. In Fig. 9, we show the percentage of attendees that rated the specific approach good or very good. As can be seen our approach gives the best visual quality and also state-of-the-art quality for audio-visual sync for photo-realistic methods based on audio input similar to the video-based approach of Thies et al. . The method of Vougioukas  achieves higher audio-visual sync but lacks visual quality and is not able to synthesize natural videos.
As can be seen in the supplemental video, our approach works robustly on different audio sources and target videos. But it still has limitations. Especially, in the scenario of multiple voices in the audio stream our method fails. Recent work is solving this ’cocktail party’ issue using visual clues . As all other reenactment approaches, the target videos have to be occlusion free to allow good visual tracking. Another limitation is the fixed talking style. We assume that the target actor has a constant talking style during a target sequence. In follow-up work we plan to estimate the talking style from the audio signal to control the expressiveness of the facial motions.
In this work, we presented a novel audio-driven facial reenactment approach that is generalized among different audio sources. This allows us not only to synthesize videos of a talking head from an audio sequence from another person, but also to generate a photo-realistic video based on a synthesized voice. I.e., text-driven video synthesis can be achieved that is in sync with artificial voice. We hope that our work is a stepping stone in the direction to audio-visual assistants.
We gratefully acknowledge the support by the AI Foundation, Google, Sony, a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), the ERC Consolidator Grant 4DRepLy (770784), and a Google Faculty Award.
Appendix A Network Architectures
A core component of Neural Voice Puppetry is the estimation of facial expressions based on audio. To retrieve temporal coherent estimations, we employed a process with two stages. In the first stage, we estimate per frame expressions based on DeepSpeech features. The network is depicted in Fig. 10. The output of this network is an audio-expression vector of length . This audio-expression is temporally noisy and is filtered using an expression aware filtering network which can be trained in conjunction with the per frame expression estimation network. The temporal filtering mechanism is also depicted in the main paper. The underlying network that predicts the filter weights gets as input per-frame predicted audio expressions. We apply 1D-convolutional filters with kernel size that reduce the feature space successively from over , , , to . Each of these convolutions has a bias and is followed by a leaky ReLU activation (negative slope of ). The output of the convolutional network is input to a fully connected layer with bias that maps the input to the filter weights that are normalized using a softmax function.
To train the network we apply a vertex-based loss as described in the main paper. The vertices that refer to the mouth region are weighted with a higher loss. We use the mask that is depicted in Fig. 11. For generalization we used a dataset composed of commentators from the German public TV (e.g., https://www.tagesschau.de/multimedia/video/video-587039.html). In total the dataset contained videos.
In Fig. 12, we show an overview of our neural rendering approach. Based on the expression predictions, that drive a person-specific 3D face model, we render a neural texture to the image space of the target video. A first network is used to convert the neural descriptors sampled from the neural texture to RGB color values. A second network embeds this image into the target video frame. We erode the target image around the synthetic image, to remove motions of the target actor like chin movements. Using this eroded target image as background and the output of the first network, the second network outputs the final image. Both networks have the same structure, only the input dimensions are different. The first network gets an image with feature channels as input (dimension of the neural descriptors that are sampled from a neural texture with dimensions ), while the second network composites the background and the output of the first network, resulting in an channel input. The networks are implemented in the Pix2Pix framework . Instead of a classical U-Net with strided convolutions, we build on dilated convolutions. Specifically, we replace the strided convolutions in a U-Net of depth . Instead of transposed convolutions we use a standard convolution, since we do not downsample the image and always keep the same image dimensions. Note that we also keep the skip connections of the classical U-Net. The number of features per layer is in our experiments, resulting in networks with parameters (which is low in comparison to the used network in Deferred Neural Rendering  with parameters). We employ the structure that is depicted in Fig. 14. Each convolution layer has a kernel size of and is followed by a leaky ReLU with negative slop of . All layers have stride which means that all layers intermediate feature maps have the same spatial size as the input (). The first convolutional layer maps to a feature space of dimension and has a dilation of . With increasing layer depth the feature space dimension as well as the dilation increases by a factor of . After layer depth , we use standard convolutions.
Appendix B User Study
In this section, we present the statistics of our user study. Fig. 15 shows a collection of videos that we used for the user study. The clips are from the official videos of the corresponding methods and are similar to the clips that we show in our supplemental video. Fig. 15 shows the average answers of our questions, including the variance.
In the user study we asked the following questions:
How would you rate the audio-visual alignment (lip sync) in this video?
How would you rate the visual quality of the video?
With the answer possibilities ”very good”,”good”,”Neither good nor bad”,”bad”, ”very bad”.
Appendix C Ethical Considerations
In conjunction with person specific audio generators like Jia et al. , a pipeline can be established that creates video-realistic (temporal voice- and photo-realistic) content of a person. This is perfect for creative people in movie and content production, to edit and create new videos. On the other hand, it can be misused. To this end, the field of digital media forensics is getting more attention. Recent publications  show that humans have a hard time in detecting fakes, especially, in the case of compressed video content. Learned detectors are showing promising results, but are lacking generalizeability to other manipulation methods that are not in the training corpus. Few-shot learning methods like ForensicTransfer  try to solve this issue. As part of our responsibility, we are happy to share generated videos of our method with the forensics community. Nevertheless, our approach enables several practical use-cases, ranging from movie-dubbing to text-driven photo-realistic video avatars. We hope that our work is a stepping stone in the direction of audio-based reenactment and is inspiring more follow-up projects in this field.
- (1999) A morphable model for the synthesis of 3D faces. In ACM Transactions on Graphics (Proceedings of SIGGRAPH), pp. 187–194. Cited by: §3, §4.1.
- (1997) Video rewrite: driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, pp. 353–360. Cited by: §2.
- (2017) You said that?. In British Machine Vision Conference (BMVC), Cited by: §1, §2, Figure 6, §6.2.
- (2016) Lip reading in the wild. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
- (2018) ForensicTransfer: weakly-supervised domain adaptation for forgery detection. arXiv. Cited by: Appendix C.
- (2019) Capture, learning, and synthesis of 3D speaking styles. Computer Vision and Pattern Recognition (CVPR). Cited by: §2, §4.1, Figure 5, §6.1.
- (2018-07) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. on Graph. 37 (4), pp. 112:1–112:11. Cited by: §7.
- (2002) Trainable videorealistic speech animation. In ACM Trans. on Graph., pp. 388–398. Cited by: §2.
- A. J. Sellen and S. B. Wilbur (Eds.) (1997) Video-mediated communication. L. Erlbaum Associates Inc.. Cited by: §1.
- (2019-07) Text-based editing of talking-head video. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 38 (4), pp. 68:1–68:14. Cited by: §2, §6.4.
- (2015) VDub - modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum (Proceedings of EUROGRAPHICS), Cited by: §1, §2, §6.3.
- (2016) Reconstruction of personalized 3D face rigs from monocular video. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 35 (3), pp. 28. Cited by: §2.
- (2014-12) DeepSpeech: scaling up end-to-end speech recognition. pp. . Cited by: §2, §4.1, §5.1.
- (2016) Image-to-image translation with conditional adversarial networks. arxiv. Cited by: Appendix A.
- (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In International Conference on Neural Information Processing Systems (NIPS), pp. 4485–4495. Cited by: Appendix C, §1.
- (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §5.2.
- (2017-07) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 36 (4). Cited by: §2, §6.1.
- (2019) Neural style-preserving visual dubbing. ACM Trans. on Graph. (Proceedings of SIGGRAPH-Asia). Cited by: §1, §2, §6.3.
- (2018-07) Deep video portraits. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 37 (4), pp. 163:1–163:14. Cited by: §2, §6.3.
- (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §5.3.
- (2017-11) Learning a model of facial shape and expression from 4D scans. ACM Trans. on Graph. 36 (6). Note: Two first authors contributed equally Cited by: §2.
- (2018-07) Deep appearance models for face rendering. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 37 (4), pp. 68:1–68:13. Cited by: §1.
- (2019) FaceForensics++: learning to detect manipulated facial images. arXiv. Cited by: Appendix C.
- (2017-07) Synthesizing obama: learning lip sync from audio. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 36 (4). Cited by: §1, §2, §6.1.
- (2013) Seeing is believing but is hearing? comparing audio and video communication for young children. Frontiers in Psychology 4, pp. 64. Cited by: §1.
- (2017-07) A deep learning approach for generalized speech animation. ACM Trans. on Graph. 36 (4). Cited by: §2, §6.1.
- (2018) FaceVR: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. on Graph.. Cited by: §1.
- (2018) HeadOn: real-time reenactment of human portrait videos. ACM Trans. on Graph. (Proceedings of SIGGRAPH). Cited by: §1.
- (2016) Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In CVPR, Cited by: §1, §2, §3, §4.1, §5.1, §6.3.
- (2019) Deferred neural rendering: image synthesis using neural textures. ACM Trans. on Graph. (Proceedings of SIGGRAPH). Cited by: Appendix A, §2, §5.2, §6.3, §6.5.
- (2019) Synthesising 3D facial motion from ”in-the-wild” speech. CoRR abs/1904.07002. Cited by: §2.
- (2016) WaveNet: a generative model for raw audio. In Arxiv, External Links: Cited by: §1.
- (2018) End-to-end speech-driven facial animation with temporal gans. In BMVC, Cited by: §1, §2.
- (2019-10-13) Realistic speech-driven facial animation with gans. International Journal of Computer Vision (IJCV). Cited by: §2, Figure 6, §6.2, §6.5.
- (2018) Self-attention generative adversarial networks. arXiv:1805.08318. Cited by: §5.1.
- (2018) State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum (Eurographics State of the Art Reports) 37 (2). Cited by: §1, §2.