Seeing What You Sound Like
Audio-Guided Face Super-Resolution
Audio Face Super-Resolution
Learning Face Super-Resolution through Sound
Learning to Have an Ear for Face Super-Resolution
We propose a novel method to perform extreme (16x) face super-resolution by exploiting audio. Super-resolution is the task of recovering a high-resolution image from a low-resolution one. When the resolution of the input image is too low (e.g., pixels), the loss of information is so dire that the details of the original identity have been lost. However, when the low-resolution image is extracted from a video, the audio track is also available. Because the audio carries information about the face identity, we propose to exploit it in the face reconstruction process. Towards this goal, we propose a model and a training procedure to extract information about the identity of a person from her audio track and to combine it with the information extracted from the low-resolution input image, which relates more to pose and colors of the face. We demonstrate that the combination of these two inputs yields high-resolution images that better capture the correct identity of the face. In particular, we show that audio can assist in recovering attributes such as the gender and the identity, and thus improve the correctness of the image reconstruction process. Our procedure does not make use of human annotation and thus can be easily trained with existing video datasets. Moreover, we show that our model allows one to mix low-resolution images and audio from different videos and to generate realistic faces with semantically meaningful combinations.
Image super-resolution is the task of recovering details of an image that has been degraded through a spatial quantization process. A formulation of this problem is to define a (low) source and a (high) target resolution, which are related by a fixed scaling factor. Typical scaling factors are in the order of or increase of resolution from the input to the output. For these factors the restoration task is close to a sharpening problem and the focus is more on recovering small structures such as edges and corners or even small texture patches. In the more extreme case where the scaling factor is or above, the loss of detail is so considerable that important semantic information might be lost. This is the case, for example, of images of faces at the pixels resolution, where information about the original identity of the person is no longer available (see, for example, the low-resolution input in Fig. 1). What is left is the viewpoint and colors of the face, clothing and the background. While it is possible to hallucinate plausible high-resolution images from such limited information, it would be difficult to correctly recover useful attributes such as the gender or even the identity.
If the low-resolution image of a face is extracted from a video, we might also have access to the audio of that person. Despite the very different nature of aural and visual signals, they both capture some shared attributes of a person and, in particular, her identity. In fact, when we hear the voice of an iconic actor we can often picture his or her face in our mind. What is more remarkable is that we can even match the looks of a person we have not met with their sound (see Smith et al. (2016)). While this aural-visual redundancy may be a benefit to social communication, it can also be a benefit to image processing and, in particular, image super-resolution. Therefore, we propose to build a model for face super-resolution by exploiting both a low-resolution image and its audio. To the best of our knowledge, this has never been explored before.
A natural way to solve this task is to build a model with two encoding networks, one for the low-resolution image and one for the audio, and a decoding network mapping the concatenation of the encoders output to a high resolution image. Unfortunately, we found experimentally that the joint training of these networks leads to a degenerate solution, where the audio encoder learns to ignore its input. We conjecture that this is due to the different nature of the aural and visual signals, which corresponds to a different level of difficulty in disentangling their information to a common latent space. Thus, it might be easier for the training algorithm to overfit on the training set with the low-resolution images, than to exploit the audio input. To address this conjecture, we propose to train the low-resolution image encoder and the audio encoder separately, so that their disentanglement accuracy can be equalized.
To this aim, we first train a generator that starts from a Gaussian latent space and outputs high resolution images (see Fig. 1). The generator is trained as in the recent StyleGAN of Karras et al. (2018), which produces very high quality samples and a hierarchical structure of the latent space. Then, we train a reference encoder to invert the generator by using an autoencoding constraint. The reference encoder maps a high-resolution image to the latent space of the generator, which then outputs an approximation of the input image. Then, given a matching high/low-resolution image pair, we pre-train a low-resolution image encoder to map its input to the same latent representation of the reference encoder (on the high-resolution image). As a second step, we train an audio encoder and a fusion network to improve the latent representation of the (fixed) low-resolution image encoder . To speed up the training of the audio encoder we also pre-train it by using as latent representation the average of the outputs of the reference encoder on a high-resolution image and its horizontally mirrored version (this averaging removes information, such as the viewpoint, that the audio cannot carry). In Section 3, we describe in detail the training of each of the above models. Finally, in Section 4 we demonstrate experimentally that the proposed architecture and training procedure successfully fuses aural and visual data. We show that the fusion yields high resolution images with more accurate identities and gender attributes than the reconstruction based on the lone low-resolution image. We also show that the fusion is semantically meaningful by mixing low-resolution images and audio from different videos (see an example in Fig. 1 (b)).
2 Related Work
General Super-Resolution. Advances in general super-resolution have been largely driven by the introduction of task-specific network architectures and components. Examples are: Residual Dense Networks (Zhang et al. (2018)), Multi-scale Residual Networks (Li et al. (2018)), lightweight Cascading Residual Networks (Ahn et al. (2018)), Deep Recursive Residual Networks (Tai et al. (2017)), Deep Residual Channel Attention Networks (Zhang et al. (2018b)), Information Distillation Networks (Hui et al. (2018)), Spatial Feature Transformer layers (Wang et al. (2018)), Deep Back-Projection Networks (Haris et al. (2018)), Dual-State Recurrent Networks (Han et al. (2018)), and the Laplacian Pyramid Super-Resolution Networks (Lai et al. (2017)). In our method, we do not rely on task-specific architectures, although we leverage the design of a state-of-the-art generative model (Karras et al. (2018)). Many general super-resolution methods also make use of adversarial training. Ledig et al. (2017) presented SRGAN, a generative adversarial network for image super-resolution. Park et al. (2018) suggested a GAN based super-resolution method with an additional feature discriminator. Bulat et al. (2018) trained a GAN to learn the true image degradation model (high-to-low) and used it to train another low-to-high GAN. While we also make use of a GAN, our approach is quite different from those works since we pre-train and fix the generator of the GAN.
Face Super-Resolution. The face super-resolution problem has attracted a lot of attention in recent years. Several methods were proposed that tackle the problem with additional supervision and multi-task learning. Bulat and Tzimiropoulos (2018) proposed Super-FAN, a method that addresses both face super-resolution and landmark detection in a multi-task fashion. Yu et al. (2018) tied super-resolution to a facial component heatmap estimation. Chen et al. (2018) incorporated facial landmark heatmaps and parsing maps as a facial geometry prior. Zhang et al. (2018a) improved the recognizability of ultra-low-resolution faces using identity information in an identity loss that measures the identity difference between a hallucinated face and its high-resolution counterpart. Yu et al. (2018) proposed an attribute guided upsampling CNN to reduce the ambiguity of one-to-many mappings in the case of super-resolution. In contrast, by using videos with corresponding audio tracks our method does not rely on additional human annotation. Several face super-resolution models rely on the use of GANs to produce realistic high-resolution outputs (Bulat and Tzimiropoulos (2018); Chen et al. (2018); Yu et al. (2018)). Xu et al. (2017) adopted a GAN to learn a category-specific prior for face super-resolution. Our work also relies on the use of a GAN to learn a face specific prior. Our model is more general however since it provides a one-to-many mapping through the conditioning on the audio.
Usage of Audio in Vision Tasks. The usage of audio in combination with video has received a lot of attention recently. Zhou et al. (2018) trained a CNN to directly predict the raw audio waveform from a sequence of video frames. Shlizerman et al. (2018) showed that class specific natural body dynamics can be predicted from an audio signal. Arandjelovic and Zisserman (2018) tackled the audio-video correspondence task and demonstrated the capability to localize semantic objects within an image based on audio. Owens and Efros (2018) additionally achieved on/off-screen audio source separation. Sterling et al. (2018) presented a multimodal neural network optimized for estimating an object’s geometry and material. Zhao et al. (2018) introduced PixelPlayer, a system that learns to separate the input sounds into a set of components that represent the sound from each pixel. Two recent works proposed visually-aided audio source separation frameworks (Gao et al. (2018); Ephrat et al. (2018)). Tian et al. (2018) addressed the problem of Audio-Visual Event localization. Some works tackled the problem of generating talking faces based on an input face image and an audio signal (Song et al. (2018); Zhu et al. (2018)). To the best of our knowledge we are the first to use audio in an image restoration task.
3 Using Audio for Face Super-Resolution
Our goal is to design a model that is able to generate high resolution images based on a low resolution input image and an additional audio signal. The dataset is therefore given by where is the high-resolution image, is the low-resolution image and is a corresponding audio signal. Our model consists of several components: a low-resolution encoder , an audio encoder , a fusion network and a face generator . An overview of the complete architecture is given in Fig. 1.
3.1 Disentangling Representations of Aural and Visual Signals
As mentioned in the introduction, a natural choice to solve our task is to train a feedforward network to match the ground truth high resolution image given its low-resolution image and audio signal. Experimentally, we found that such a system tends to ignore the audio signal and to yield a one-to-one mapping from a low-resolution to a single high-resolution image. We believe that this problem is due to the different nature of the aural and visual signals, and the choice of the structure of the latent space. The fusion of both signals requires disentangling their information to a common latent space through the encoders. However, the audio signal seems to require longer processing and more network capacity to convert it to the latent space. This conversion can also be aggravated by the structure of the latent space, which might be biased more towards images than audio. Ideally, the low-resolution image should only condition the feedforward network to produce the most likely corresponding high-resolution output and the audio signal should introduce some local variation (i.e., modifying the gender or the age of the output). Therefore, for the fusion to be effective it would be useful if the audio could act on some fixed intermediate representation from the low-resolution image, where face attributes present in the audio are disentangled.
For these reasons we opted to pre-train and fix the decoder network as the generator of a StyleGAN (Karras et al. (2018)). This model has been shown to produce realistic high resolution images along with a good disentanglement of the factors of variation in the intermediate representations. Such a model should therefore act as a good prior for generating high resolution face images and the disentangled intermediate representations should allow better editing based on the audio signal. Formally, we learn a generative model of face images where by optimizing the default non-saturating loss function of the StyleGAN (see Karras et al. (2018) for details).
3.2 Inverting the Generator
The task of regressing the high-resolution images directly can be translated into mapping them to the latent representation that, when fed to the generator network, results in the original high-resolution images. Our goal is that the fusion of the information provided by the low-resolution image and audio track will result in a reconstruction that is close to the original high resolution image. We therefore want to find suitable targets for the training of our model. To this end, we train a high-resolution image encoder by minimizing
where is a perceptual loss based on VGG16 features (see Supplementary material for more details). We found that regressing a single is not sufficient to recover a good approximation of . In the style-based generator each is mapped to a vector , which is then replicated and inserted at different layers of the generator (each corresponding to different image scales). To improve the high-resolution reconstruction, we instead generate different , , and feed the resulting to the corresponding layers in the generator. The output of therefore lies in . Note that this is not too dissimilar from the training of the style-based generator, where the -s of different images are randomly mixed at different scales.
3.3 Pre-Training Low-Resolution and Audio Encoders
Given the high-resolution image encoder, we now have target outputs for the low-resolution and audio fusion. However, the training of a fusion model directly on these targets runs into some difficulties. As mentioned before, we find experimentally that, given enough capacity, a fusion model learns to predict well, while ignoring the audio signal almost completely. To address this degenerate behavior, we train the two encoders and separately to extract as much information from the two modalities as possible and only later fuse them. To ensure that neither of the two encoders can overfit the whole training set we partition it into two sets: for the encoder pre-training and for the later fusion training.
The low-resolution encoder is trained to regress the high-resolution encodings from , while also preserving the super-resolution constraint, i.e., that the downsampled generated image matches the low-resolution input image. By combining these constraints, we minimize
where is the downsampling of and . In the case of the audio encoding, regressing all the information in with is not possible, as many of the factors of variation in , e.g., the pose of the face, are not present in . To remove the pose from we generate the targets for the audio encoder as , where is a horizontally flipped version of the image . As it turns out, due to the disentangled representations of , the reconstruction produces a neutral frontal facing version of (see Fig. 2). The audio encoder is finally trained to minimize
3.4 Fusing Audio and Low-Resolution Encodings
Given the pre-trained encoders and , we now want to fuse the information provided by the two signals. Since the low-resolution encoder already provides a good approximation to it is reasonable to use it as a starting point for the final prediction. Conceptually, we can think of as providing a that results in a canonical face corresponding to the low-resolution image . Ambiguities in could then possibly be resolved via the use of the audio, which would provide an estimate of the residual . We therefore model the fusion mechanism as , where is a simple fully-connected network acting on the concatenation of and . Since the audio-encoding might be suboptimal for the fusion, we continue training it along with . The limited complexity of the function prevents the overfitting to the low-resolution encoding, but provides the necessary context for the computation of . To summarize, we train the fusion by optimizing
We performed all our experiments on a subset of the VoxCeleb2 dataset (Chung et al. (2018)). The dataset contains over one million audio tracks extracted from 145K videos of people speaking. For the full training set we selected 104K videos with 545K audio tracks and extracted around 2M frames such that each speaker has at least 500 associated frames. We then split this dataset in half to create and in such a way that and contain the same speakers but different videos. For the test set we selected 39K frames and 37K utterances from 25K videos not contained in the training set (again from the same speakers). In the end we select around 4K speakers out of the 6K speakers in the full dataset (filtering out speakers with very few videos and audio tracks). Note that this selection is purely done to allow the evaluation via a speaker identity classifier.
The style-based generator was pre-trained on the full training set with all hyper-parameters set to their default values (see Karras et al. (2018) for details). It has seen a total of 31 million images. The high-resolution encoder was trained for 125K iterations and a batch-size of 256 on the pixels images from . The low-resolution encoder and the audio encoder were trained on . was trained for 45K iterations with a batch-size of 256 and was trained for 95K iterations and a batch-size of 128. The inputs to are of size pixels and the inputs to are the audio log-spectrograms of of size elements. The fine-tuning of and the training of the fusion layer was performed for 115K iterations on . We used the Adam optimizer (Kingma and Ba (2014)) with a fixed learning rate of for the training of all the networks. A detailed description of the network architectures can be found in the supplementary material.
4.3 Gender and Identity Classification Accuracy as a Performance Measure
To evaluate the capability of our model to recover gender and other identity attributes based on the low-resolution and audio inputs we propose to use the accuracy of a pre-trained identity classifier and gender classifier . To this end, we fine-tune two VGG-Face CNNs of Parkhi et al. (2015) on the training set for 10 epochs on both face attributes. As you can see in Table 1 these classifiers perform well on the test set on both face attributes.
Ablations. We performed ablation experiments to understand the information retained in the encoders and to justify the design of our final model. The accuracy of the classifiers and are reported in Table 1 for the following ablation experiments:
- (a)-(c) Individual components:
Shows the performance of the individual encoders on their own. Results for the high-resolution encoder , the low-resolution encoder and the audio encoder are reported.
- (d)-(f) Fusion strategies:
The performance of different fusion strategies is reported. We report results of our fusion model with a single fully-connected layer and fine-tuning of . We compare this to results of a more complex fusion network with three fully-connected layers and a version of without fine-tuning of .
We can observe that is able to predict the correct gender more often than . All the fusion approaches lead to an improvement in terms of identity prediction over and alone, thus showing that the information from both inputs is successfully integrated. Note that the limited performance of might be due to some degree of mode-collapse in and the resulting inability of to exactly reconstruct all the high resolution images. See Fig. 3 for qualitative results.
|High-resolution test images||95.25%||99.53%|
|(d) + fine-tuned||16.99%||93.92%|
|(e) + fine-tuned||20.96%||95.03%|
|(f) + fixed||11.54%||92.44%|
Comparisons to Other Super-Resolution Methods. We compare to state-of-the-art super-resolution methods in Table 2 and Fig. 4. The standard metrics PSNR and SSIM along with the accuracy of and are reported for super-resolved images of our test set. Note that most other methods were not trained on extreme super-resolution factors of , but rather on factors of . Unsurprisingly, the methods using a factor of perform much better than our model in terms of PSNR and SSIM. Notice that although LapSRN trained on super-resolution performs better in terms of PSNR and SSIM than our method, the quality of the recovered image is clearly worse (see Fig. 4). This difference in the quality is instead revealed by evaluating the gender and identity classification accuracies of the restored images. This suggests that while PSNR and SSIM may be suitable metrics to evaluate reconstructions with small super-resolution factors, they may not be suitable to assess the reconstructions in more extreme cases such as with a factor of .
|SRGAN (Ledig et al. (2017))||26.21||0.85||52.95%||97.01%|
|VDSR (Kim et al. (2016))||25.47||0.85||89.61%||99.02%|
|SRFeat (Park et al. (2018))||27.02||0.83||93.40%||99.27%|
|LapSRN (Lai et al. (2017))||31.99||0.91||93.83%||99.38%|
|LapSRN (Lai et al. (2017))||22.75||0.64||5.27%||83.27%|
Editing by Mixing Audio Sources. Our model allows us to influence the high-resolution output by interchanging the audio track used in the fusion. To demonstrate this capability we show examples where we mix a fixed low-resolution input with several different audio sources in Fig. 5. We also use this mixing modality to show quantitatively that audio is used in the generation of the high-resolution image. We generate high-resolution images by taking low-resolution images and audios from videos of persons of different gender. We then classify the gender of these high-resolution images and find that in % of the cases the predicted gender matched that of the audio source.
We have introduced a new paradigm for face super-resolution, where also audio contributes to the restoration of missing details in the low-resolution input image. We have described the design of a neural network and the corresponding training procedure to successfully make use of the audio signal despite the difficulty of extracting visual information from it. We have also shown quantitatively that audio can contribute to improving the accuracy of the identity as well as the gender of the restored face. Moreover, we have shown that it is possible to mix low-resolution images and audios from different videos and obtain semantically meaningful high resolution images.
- Fast, accurate, and lightweight super-resolution with cascading residual network. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Objects that sound. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Super-fan: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- To learn image super-resolution, use a gan to learn how to do image degradation first. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- FSRNet: end-to-end learning face super-resolution with facial priors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- VoxCeleb2: deep speaker recognition. In INTERSPEECH, Cited by: §4.1.
- Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619. Cited by: §2.
- Learning to separate object sounds by watching unlabeled video. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Image super-resolution via dual-state recurrent networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Deep back-projection networks for super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Fast and accurate single image super-resolution via information distillation network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1, §2, §3.1, §4.2.
- Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: Table 2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Deep laplacian pyramid networks for fast and accurate super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 2.
- Photo-realistic single image super-resolution using a generative adversarial network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 2.
- Multi-scale residual network for image super-resolution. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- SRFeat: single image super-resolution with feature discrimination. In The European Conference on Computer Vision (ECCV), Cited by: §2, Table 2.
- Deep face recognition. In British Machine Vision Conference, Cited by: §4.3.
- Audio to body dynamics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Matching novel face and voice identity using static and dynamic facial images. Attention, perception & psychophysics 78 (3), pp. 868–879. Cited by: §1.
- Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786. Cited by: §2.
- ISNN: impact sound neural network for audio-visual object classification. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Image super-resolution via deep recursive residual network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Audio-visual event localization in unconstrained videos. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Recovering realistic texture in image super-resolution by deep spatial feature transform. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Learning to super-resolve blurry face and text images. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
- Face super-resolution guided by facial component heatmaps. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Super-resolving very low-resolution face images with supplementary attributes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Super-identity convolutional neural network for face hallucination. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Image super-resolution using very deep residual channel attention networks. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Residual dense network for image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- The sound of pixels. In The European Conference on Computer Vision (ECCV), Cited by: §2.
- Visual to sound: generating natural sound for videos in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589. Cited by: §2.