Towards Real-World Blind Face Restoration with Generative Facial Prior

Towards Real-World Blind Face Restoration with Generative Facial Prior

Abstract

Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.

1 Introduction

Blind face restoration aims at recovering high-quality faces from the low-quality counterparts suffering from unknown degradation, such as low-resolution [14, 50, 8], noise [72], blur [40, 60], compression artifacts [13], \etc. When applied to real-world scenarios, it becomes more challenging, due to more complicated degradation, diverse poses and expressions. Previous works [8, 70, 6] typically exploit face-specific priors in face restoration, such as facial landmarks [8], parsing maps [6, 8], facial component heatmaps [70], and show that those geometry facial priors are pivotal to recover accurate face shape and details. However, those priors are usually estimated from input images and inevitably degrades with very low-quality inputs in the real world. In addition, despite their semantic guidance, the above priors contain limited texture information for restoring facial details (\eg, eye pupil).

Another category of approaches investigates reference priors, \ie, high-quality guided faces [48, 47, 12] or facial component dictionaries [46], to generate realistic results and alleviate the dependency on degraded inputs. However, the inaccessibility of high-resolution references limits its practical applicability, while the fixed capacity of dictionaries restricts its diversity and richness of facial details.

In this study, we leverage Generative Facial Prior (GFP) for real-world blind face restoration, \ie, the prior implicitly encapsulated in pretrained face Generative Adversarial Network (GAN) [19] models such as StyleGAN [36, 37]. These face GANs are capable of generating faithful faces with a high degree of variability, and thereby providing rich and diverse priors such as geometry, facial textures and colors, making it possible to jointly restore facial details and enhance colors (ours in Fig. LABEL:fig:teaser). However, it is challenging to incorporate such generative priors into the restoration process. Previous attempts typically use GAN inversion [20, 56, 54]. They first ‘invert’ the degraded image back to a latent code of the pretrained GAN, and then conduct expensive image-specific optimization to reconstruct images. Despite visually realistic outputs, they usually produce images with low fidelity, as the low-dimension latent codes are insufficient to guide accurate restoration.

To address these challenges, we propose a novel GFP-GAN with delicate designs to achieve a good balance of realness and fidelity in a single forward pass. In specific, GFP-GAN consists of a degradation removal module and a pretrained face GAN as facial prior. They are connected by a direct latent code mapping, and several Channel-Split Spatial Feature Transform (CS-SFT) layers in a coarse-to-fine manner. The proposed CS-SFT layers perform spatial modulation on a split of features and leave the left features to directly pass through for better information preservation, allowing our method to effectively incorporate generative prior while retraining high fidelity. Besides, we introduce facial component loss with local discriminators to further enhance perceptual facial details, while employing identity preserving loss to further improve fidelity.

We summarize the contributions as follows. (1) We leverage rich and diverse generative facial priors for bind face restoration. Those priors contain sufficient facial textures and vivid color information, allowing us to jointly perform face restoration and color enhancement. (2) We propose a novel GFP-GAN framework with delicate designs of architectures and losses to incorporate generative facial prior. Our GFP-GAN with CS-SFT layers achieves a good balance of fidelity and texture faithfulness in a single forward pass. (3) Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.

2 Related Work

Image Restoration typically includes super-resolution [14, 50, 62, 51, 75, 10, 23, 52], denoising [72, 44, 27], deblurring [67, 40, 60] and compression removal [13, 22]. To achieve visually-pleasing results, generative adversarial network [19] is usually employed as loss supervisions to push the solutions closer to the natural manifold [43, 59, 66, 7, 15], while our work attempts to leverage the pretrained GANs as generative facial priors (GFP).

Face Restoration. Based on general face hallucination [5, 31, 68, 71], two typical face-specific priors: geometry priors and reference priors, are incorporated to further improve the performance. The geometry priors include facial landmarks [8, 38, 78], face parsing maps [60, 6, 8] and facial component heatmaps [70]. However, 1) those priors require estimations from low-quality inputs and inevitably degrades in real-world scenarios. 2) They mainly focus on geometry constraints and may not contain adequate details for restoration. Instead, our employed GFP does not involve an explicit geometry estimation from degraded images, and contains adequate textures inside its pretrained network.

Reference priors [48, 47, 12] usually rely on reference images of the same identity. To overcome this issue, DFDNet [46] suggests to construct a face dictionary of each component (\eg, eyes, mouth) with CNN features to guide the restoration. However, DFDNet mainly focuses on components in the dictionary and thus degrades in the regions beyond its dictionary scope (\eg, hair, ears and face contour), instead, our GFP-GAN could treat faces as a whole to restore. Moreover, the limited size of dictionary restricts its diversity and richness, while the GFP could provide rich and diverse priors including geometry, textures and colors.

Generative Priors of pretrained GANs [35, 36, 37, 3] is previously exploited by GAN inversion [1, 77, 56, 20], whose primary aim is to find the closest latent codes given an input image. PULSE [54] iteratively optimizes the latent code of StyleGAN [36] until the distance between outputs and inputs is below a threshold. mGANprior [20] attempts to optimize multiple codes to improve the reconstruction quality. However, these methods usually produce images with low fidelity, as the low-dimension latent codes are insufficient to guide the restoration. In contrast, our proposed CS-SFT modulation layers enable prior incorporation on multi-resolution spatial features to achieve high fidelity. Besides, expensive iterative optimization is not required in our GFP-GAN during inference.

Figure 1: Overview of our GFP-GAN framework. It consists of a degradation removal module and a pretrained face GAN as facial prior. They are bridged by a latent code mapping and several Channel-Split Spatial Feature Transform (CS-SFT) layers. A good balance of realness and fidelity is realized by the proposed CS-SFT modulations. During training, we employ 1) Pyramid restoration guidance to remove complex degradation in the real world, 2) Facial component loss with discriminators to enhance facial details, and 3) identity preserving loss to retain face identity.

Channel Split Operation is usually explored to design compact models and improve model representation ability. MobileNet [29] proposes depthwise convolutions and GhostNet [24] splits the convolutional layer into two parts and uses fewer filters to generate intrinsic feature maps. Dual path architecture in DPN [9] enables feature re-usage and new feature exploration for each path, thus improving its representation ability. A similar idea is also employed in super-resolution [76]. Our CS-SFT layers share the similar spirits, but with different operations and purposes. We adopt spatial feature transform [65, 57] on one split and leave the left split as identity to achieve a good balance of realness and fidelity.

Local Component Discriminators. Local discriminator is proposed to focus on local patch distributions [33, 49, 64]. When applied to faces, those discriminative losses are imposed on separate semantic facial regions [45, 21]. Our introduced facial component loss also adopts such designs but with a further style supervision based on the learned discriminative features, which proves to be effective in restoring facial details.

3 Methodology

3.1 Overview of GFP-GAN

In this section, we describe our proposed GFP-GAN framework that leverages the Generative Facial Prior (GFP) encapsulated in pretrained generative adversarial networks for blind face restoration. Given an input facial image suffering from unknown degradation, the aim of blind face restoration is to estimate a high-quality image , which is as similar as possible to the ground-truth image , in terms of realness and fidelity.

The overall framework of GFP-GAN is depicted in Fig. 1. GFP-GAN is comprised of a degradation removal module and a pretrained face GAN (such as StyleGAN2 [37]) as prior. They are bridged by a latent code mapping and several Channel-Split Spatial Feature Transform (CS-SFT) layers. In specific, the degradation removal module is designed to remove complicated degradation in the input from the real word, and extract clean latent features and multi-resolution spatial features for subsequent operations (see Sec. 3.2). After that, is mapped to intermediate latent codes , coarsely ‘retrieving’ features of a close face, denoted by , in the leaned face GAN distributions (see Sec. 3.3). Multi-resolution features are used to spatially modulate the face GAN features with the proposed CS-SFT layers in a coarse-to-fine manner, achieving realistic results while preserving high fidelity (see Sec. 3.4).

During training, except for the global discriminative loss, we introduce facial component loss with discriminators to enhance the perceptually significant face components, \ie, eyes and mouth. In order to retrain identity, we also employ identity preserving guidance. More details are in Sec. 3.5.

3.2 Degradation Removal Module

In real-world scenarios, blind face restoration becomes more challenging due to more complicated and severer degradation, which is typically a mixture of low-resolution, blur, noise and JPEG compression artifacts. Different from previous approaches [70, 6, 47, 46] that usually learn to incorporate facial priors and eliminate degradation simultaneously, our degradation removal module is designed to explicitly remove troublesome corruptions and extract ‘clean’ features with intermediate guidance, thus alleviating the burden of subsequent modules.

The degradation removal module adopts a UNet [58] structure to 1) increase receptive field for large blur elimination, and 2) to generate multi-resolution features for subsequent operations. We also employ pyramid restoration guidance [42] for intermediate supervision [62]. The pyramid guidance at different levels could not only strengthen the restoration ability, but also provide ‘clean’ features for each resolution, which is required by subsequent coarse-to-fine prior incorporation.

3.3 Generative Facial Prior and Latent Code Mapping

Current face GANs [35, 36, 37] are able to generate realistic and vivid human faces with a high variability of identity, pose and expression. We leverage such pretrained face GANs that capture real-life face distributions to provide diverse and rich facial prior, namely generative facial prior, for our task. However, such generative priors could not be directly integrated into the restoration process since the priors are implicitly encapsulated in GANs by mapping random latent codes to real faces.

A typical way of deploying generative priors is to map the input image to its closest latent codes  [1, 77, 56, 20]. These methods require time-consuming iterative optimization for preserving fidelity. By contrast, we adopt such latent code mapping merely for a coarse mapping with one feed-forward pass, and leave the fidelity burden to subsequent channel-split spatial feature transform layers. We follow the practice in [77] that inverts input images to intermediate latent codes (\ie, the intermediate space transformed from with several multi-layer perceptron layers) for better preserving semantic property. The latent codes are used for ‘retrieving’ features of a close face in the leaned face GAN distributions. After this operation, we could obtain GAN features capturing generative facial priors.

Discussion: Joint Restoration and Color Enhancement. Generative models capture diverse and rich priors beyond realistic details and vivid textures. For instance, they also encapsulate color priors, which could be employed in our task for joint face restoration and color enhancement. In specific, real-world face images, \eg, old photos, usually have black-and-white color, vintage yellow color, or dim color. Lively color prior in generative facial prior allows us to perform color enhancement including colorization [74]. We believe the generative facial priors also incorporate conventional geometric priors [8, 70] and even 3D priors [17] for restoration and manipulation, which will be left as a future work.

3.4 Channel-Split Spatial Feature Transform

Given the prior features from the pretrained GAN and the extracted ‘clean’ features from the input image, our aim is to effectively incorporate these two features while preserving realness and fidelity.

One effective approach is spatial feature transform [65], which generates affine transformation parameters for spatial-wise feature modulation, and has shown its effectiveness on incorporating other conditions in image restoration [65, 46] and image generation [57]. In our task, a pair of affine transformation parameters is first learned from input features by several convolutional layers. After that, the modulation is carried out by scaling and shifting the prior features :

(1)

Despite its effectiveness in incorporating the input face information , it could not achieve a good balance of realness and fidelity, since all the prior features (contributing to realness) are affected by input features (contributing to fidelity). In particular, with extreme low-quality images as inputs, the ‘blurry’ input features impose their influence on the modulations, and the outputs inevitably bias to ‘blurry’ results without realistic and rich facial details.

To address this problem, we propose channel-split spatial feature transform (CS-SFT) layers, which perform spatial modulation on part of the features and leave the left features to directly pass through for better information preservation, as shown in Fig. 1. Mathematically, we have:

(2)

where and are split features from in channel dimension, and denotes the concatenation operation.

As a result, CS-SFT enjoys the benefits of directly incorporating prior information and effective modulating by input images, thereby achieving a good balance between texture faithfulness and fidelity. CS-SFT shares similar spirits as DPN [9], whose dual path architecture enables feature re-usage and new feature exploration for each path, thus improving its representation ability.

Besides the performance, CS-SFT could also reduce complexity as it requires fewer channels for modulation, similar to GhostNet [24].

3.5 Model Objectives

The learning objective of training our GFP-GAN consists of: 1) reconstruction loss that constraints the outputs close to the ground-truth , 2) adversarial loss for restoring realistic textures, 3) proposed facial component loss to further enhance facial details, and 4) identity preserving loss.

Reconstruction Loss. We adopt the widely-used L1 loss and perceptual loss [34, 43] as our reconstruction loss , defined as follows:

(3)

where is the pretrained VGG-19 network [61] and we use the feature maps before activation [66]. and denote the loss weights of L1 and perceptual loss, respectively.

Adversarial Loss. We employ adversarial loss to encourage the GFP-GAN to favor the solutions in the natural image manifold and generate realistic textures. Similar to StyleGAN2 [37], logistic loss [19] is adopted:

(4)

where denotes the discriminator and represents the adversarial loss weight.

Facial Component Loss. In order to further enhance the perceptually significant face components, we introduce facial component loss with local discriminators for left eye, right eye and mouth. As shown in Fig. 1, we first crop interested regions with ROI align [25]. For each region, we train separate and small local discriminators to distinguish whether the restore patches are real, pushing the patches close to the natural facial component distributions.

Inspired by [64], we further incorporate a feature style loss based on the learned discriminators. Different from previous feature matching loss with spatial-wise constraints [64], our feature style loss attempts to match the Gram matrix statistics [16] of real and restored patches. Gram matrix calculates the feature correlations and usually effectively captures texture information [18]. We extract features from multiple layers of the learned local discriminators and learn to match these Gram statistic of intermediate representations from the real and restored patches. Empirically, we found the feature style loss performs better than previous feature matching loss in terms of generating realistic facial details and reducing unpleasant artifacts.

The facial component loss is defined as follows. The first term is the discriminative loss [19] and the second term is the feature style loss:

(5)

where is region of interest from the component collection . is the local discriminator for each region. denotes the multi-resolution features from the learned discriminators. and represent the loss weights of local discriminative loss and feature style loss, respectively.

Identity Preserving Loss. We draw inspiration from [32] and apply identity preserving loss in our model. Similar to perceptual loss [34], we define the loss based on the feature embedding of an input face. In specific, we adopt the pretrained face recognition ArcFace [11] model, which captures the most prominent features for identity discrimination. The identity preserving loss enforces the restored result to have a small distance with the ground truth in the compact deep feature space:

(6)

where represents face feature extractor, \ieArcFace [11] in our implementation. denotes the weight of identity preserving loss.

The overall model objective is a combination of the above losses:

(7)

The loss hyper-parameters are set as follows: , , , , and .

4 Experiments

4.1 Datasets and Implementation

Training Datasets. We train our GFP-GAN on the FFHQ dataset [36], which consists of high-quality images. We resize all the images to during training.

Our GFP-GAN is trained on synthetic data that approximate to the real low-quality images and generalize to real-world images during inference. We follow the practice in [48, 46] and adopt the following degradation model to synthesize training data:

(8)

The high quality image is first convolved with Gaussian blur kernel followed by a downsampling operation with a scale factor . After that, additive white Gaussian noise is added to the image and finally it is compressed by JPEG with quality factor . Similar to [46], for each training pair, we randomly sample , , and from , , , , respectively.

Figure 2: Qualitative comparison on the CelebA-Test for blind face restoration. Our GFP-GAN produces faithful details in eyes, mouth and hair. Zoom in for best view.

Testing Datasets. We construct one synthetic dataset and three different real datasets with distinct sources. All these datasets have no overlap with our training dataset. We provide a brief introduction here.

CelebA-Test is the synthetic dataset with 3,000 CelebA-HQ images from its testing partition [53]. The generation way is the same as that during training.

LFW-Test. LFW [30] contains low-quality images in the wild. We group all the first image for each identity in the validation partition, forming 1711 testing images.

CelebChild-Test contains 180 child faces of celebrities collected from the Internet. They are low-quality and many of them are black-and-white old photos.

WebPhoto-Test. We crawled 188 low-quality photos in real life from the Internet and extracted 407 faces to construct the WebPhoto testing dataset. These photos have diverse and complicated degradation. Some of them are old photos with very severe degradation on both details and color.

Figure 3: Comparison on the CelebA-Test for face super-resolution. Our GFP-GAN restores realistic teeth and faithful eye gaze direction. Zoom in for best view.

Implementation. We adopt the pretrained StyleGAN2 [37] with outputs as our generative facial prior. The channel multiplier of StyleGAN2 is set to one for compact model size. The UNet for degradation removal consists of seven downsamples and seven upsamples, each with a residual block [26]. For each CS-SFT layer, we use two convolutional layers to generate the affine parameters and respectively.

The training mini-batch size is set to 12. We augment the training data with horizontal flip and color jittering. We consider three components: left_eye, right_eye, mouth for face component loss as they are perceptually significant. Each component is cropped by ROI align [25] with face landmarks provided in the origin training dataset. We train our model with Adam optimizer [39] for a total of k iterations. The learning rate was set to and then decayed by a factor of 2 at the k-th, k-th iterations. We implement our models with the PyTorch framework and train them using four NVIDIA Tesla P40 GPUs.

Methods LPIPS FID NIQE Deg. PSNR SSIM
Input 0.4866 143.98 13.440 47.94 25.35 0.6848
DeblurGANv2* [41] 0.4001 52.69 4.917 39.64 25.91 0.6952
Wan \etal[63] 0.4826 67.58 5.356 43.00 24.71 0.6320
HiFaceGAN [69] 0.4770 66.09 4.916 42.18 24.92 0.6195
DFDNet [46] 0.4341 59.08 4.341 40.31 23.68 0.6622
PSFRGAN [6] 0.4240 47.59 5.123 39.69 24.71 0.6557
mGANprior [20] 0.4584 82.27 6.422 55.45 24.30 0.6758
PULSE [54] 0.4851 67.56 5.305 69.55 21.61 0.6200
GFP-GAN (ours) 0.3646 42.62 4.077 34.60 25.08 0.6777
GT 0 43.43 4.292 0 1
Table 1: Quantitative comparison on CelebA-Test for blind face restoration. Red and blue indicates the best and the second best performance. ‘*’ denotes finetuning on our training set. Deg. represents the identity distance.

4.2 Comparisons with State-of-the-art Methods

We compare our GFP-GAN with several state-of-the-art face restoration methods: HiFaceGAN [69], DFDNet [46], PSFRGAN [6], Super-FAN [4] and Wan \etal [63]. GAN inversion methods for face restoration: PULSE [54] and mGANprior [20] are also included for comparison. We also compare our GFP-GAN with image restoration methods: RCAN [75], ESRGAN [66] and DeblurGANv2 [41], and we finetune them on our face training set for fair comparisons. We adopt their official codes except for Super-FAN, for which we use a re-implementation.

For the evaluation, we employ the widely-used non-reference perceptual metrics: FID [28] and NIQE [55]. We also adopt pixel-wise metrics (PSNR and SSIM) and the perceptual metric (LPIPS [73]) for the CelebA-Test with Ground-Truth (GT). We measure the identity distance with angels in the ArcFace [11] feature embedding, where smaller values indicate closer identity to the GT.

Figure 4: Qualitative comparisons on three real-world datasets. Zoom in for best view.

Synthetic CelebA-Test. The comparisons are conducted under two settings: 1) blind face restoration whose inputs and outputs have the same resolution. 2) face super-resolution. Note that our method could take upsampled images as inputs for face super-resolution.

The quantitative results for each setting are shown in Table. 1 and Table. 2. On both settings, GFP-GAN achieves the lowest LPIPS, indicating that our results is perceptually close to the ground-truth. GFP-GAN also obtain the lowest FID and NIQE, showing that the outputs have a close distance to the real face distribution and natural image distribution, respectively. Besides the perceptual performance, our method also retains better identity, indicated by the smallest degree in the face feature embedding. Note that the pixel-wise metrics PSNR and SSIM are not correlation well with the subjective evaluation of human observers [2, 43] and our model is not good at these two metrics.

Methods LPIPS FID NIQE Deg. PSNR SSIM
Bicubic 0.4834 148.87 10.767 49.60 25.377 0.6985
RCAN* [75] 0.4159 93.66 9.907 38.45 27.24 0.7533
ESRGAN* [66] 0.4127 49.20 4.099 51.21 23.74 0.6319
Super-FAN [4] 0.4791 139.49 10.828 49.14 25.28 0.7033
GFP-GAN (ours) 0.3653 42.36 4.078 34.67 25.04 0.6744
GT 0 43.43 4.292 0 1
Table 2: Quantitative comparison on CelebA-Test for face super-resolution. Red and blue indicates the best and the second best performance. ‘*’ denotes finetuning on our training set. Deg. represents the identity distance.

Qualitative results are presented in Fig. 2 and Fig. 3. 1) Thanks to the powerful generative facial prior, our GFP-GAN recovers faithful details in the eyes (pupils and eyelashes), teeth, \etc. 2) Our method treats faces as whole in restoration and could also generate realistic hair, while previous methods that rely on component dictionaries (DFDNet) or parsing maps (PSFRGAN) fail to produce faithful hair textures (2nd row, Fig. 2). 3) GFP-GAN is capable of retaining fidelity, \eg, it produces natural closed mouth without forced addition of teeth as PSFRGAN does (2nd row, Fig. 2). And in Fig. 3, GFP-GAN also restores reasonable eye gaze direction.

Dataset LFW-Test CelebChild WebPhoto
Methods FID NIQE FID NIQE FID NIQE
Input 137.56 11.214 144.42 9.170 170.11 12.755
DeblurGANv2* [41] 57.28 4.309 110.51 4.453 100.58 4.666
Wan \etal[63] 73.19 5.034 115.70 4.849 100.40 5.705
HiFaceGAN [69] 64.50 4.510 113.00 4.855 116.12 4.885
DFDNet [46] 62.57 4.026 111.55 4.414 100.68 5.293
PSFRGAN [6] 51.89 5.096 107.40 4.804 88.45 5.582
mGANprior [20] 73.00 6.051 126.54 6.841 120.75 7.226
PULSE [54] 64.86 5.097 102.74 5.225 86.45 5.146
GFP-GAN (ours) 49.96 3.882 111.78 4.349 87.35 4.144
Table 3: Quantitative comparison on the real-world LFW, CelebChild, WebPhoto. Red and blue indicates the best and the second best performance. ‘*’ denotes finetuning on our training set. Deg. represents the identity distance.

Real-World LFW, CelebChild and WedPhoto-Test. To test the generalization ability, we evaluate our model on three different real-world datasets. The quantitative results are show in Table. 3. Our GFP-GAN achieves superior performance on all the three real-world datasets, showing its remarkable generalization capability. Although PULSE [54] could also obtain high perceptual quality (lower FID scores), it could not retain the face identity as shown in Fig 4.

The qualitative comparisons are shown in Fig. 4. GFP-GAN could jointly conduct face restoration and color enhancement for real-life photos with the powerful generative prior. Our method could produce plausible and realistic faces on complicated real-world degradation while other methods fail to recover faithful facial details or produces artifacts (especially in WebPhoto-Test in Fig 4). Besides the common facial components like eyes and teeth, GFP-GAN also perform better in hair and ears, as the GFP prior takes the whole face into consideration rather than separate parts. With SC-SFT layers, our model is capable of achieving high fidelity. As shown in the last row of Fig. 4, most previous methods fail to recover the closed eyes, while ours could successfully restore them with fewer artifacts.

4.3 Ablation Studies

CS-SFT layers. As shown in Table. 4 [configuration a)] and Fig. 5, when we remove spatial modulation layers, \ie, only keep the latent code mapping without spatial information, the restored faces could not retain face identity even with identity-preserving loss (high LIPS score and large Deg.). Thus, the multi-resolution spatial features used in CS-SFT layers is vital to preserve fidelity. When we switch CS-SFT layers to simple SFT layers [configuration b) in Table. 4], we observe that 1) the perceptual quality degrades on all metrics and 2) it preserves stronger identity (smaller Deg.), as the input image features impose influence on all the modulated features and the outputs bias to the degraded inputs, thus leading to lower perceptual quality. By contrast, CS-SFT layers provide a good balance of realness and fidelity by modulating a split of features.

Pretrained GAN as GFP. Pretrained GAN provides rich and diverse features for restoration. A performance drop is observed if we do not use the generative facial prior, as shown in Table. 4 [configuration c)] and Fig. 5.

Pyramid Restoration Loss. Pyramid restoration loss is employed in the degradation removal module and strengthens the restoration ability for complicated degradation in the real world. Without this intermediate supervision, the multi-resolution spatial features for subsequent modulations may still have degradation, resulting in inferior performance, as shown in Table. 4 [configuration d)] and Fig. 5.

Facial Component Loss. We compare the results of 1) removing all the facial component loss, 2) only keeping the component discriminators, 3) adding extra feature matching loss as in  [64], and 4) adopting extra feature style loss based on Gram statistics [16]. It is shown in Fig 6 that component discriminators with feature style loss could better capture the eye distribution and restore the plausible details.

Configuration LPIPS FID NIQE Deg.
Our GFP-GAN with SC-SFT 0.3646 42.62 4.077 34.60
a) No spatial modulation 0.550 () 60.44 () 4.183 () 74.76 ()
b) Use SFT 0.387 () 47.65 () 4.146() 34.38 ()
c) w/o GFP 0.379 () 48.47 () 4.153 () 35.04 ()
d) Pyramid Restoration Loss 0.369 () 45.17 () 4.284 () 35.50 ()
Table 4: Ablation study results on CelebA-Test under blind face restoration.
Figure 5: Ablation studies on CS-SFT layers, GFP prior and pyramid restoration loss. Zoom in for best view.
Figure 6: Ablation studies on facial component loss. In the figure, D, fm, fs denotes component discriminator, feature matching loss and feature style loss, respectively. Zoom in for best view.
Figure 7: Limitations of our model. The results of PSFRGAN [6] are also presented.

4.4 Limitations

Though our model trained on synthetic data demonstrates good generalization to real-world images, it still have some limitations. As shown in Fig. 7, when the degradation of real images is extremely severe, the restored facial details by GFP-GAN are twisted with artifacts. Our method also produces unnatural results for very large poses. This is because the synthetic degradation and training data distribution are different from those in real-world. One possible way is to learn those distributions from real data instead of merely using synthetic data, which is left as future work.

5 Conclusion

We have proposed the GFP-GAN framework that leverages the rich and diverse generative facial prior for the challenging blind face restoration task. This prior is incorporated into the restoration process with novel channel-split spatial feature transform layers, allowing us to achieve a good balance of realness and fidelity. We also introduce delicate designs such as facial component loss, identity preserving loss and pyramid restoration guidance. Extensive comparisons demonstrate the superior capability of GFP-GAN in joint face restoration and color enhancement for real-world images, outperforming prior art.

References

  1. R. Abdal, Y. Qin and P. Wonka (2019) Image2stylegan: how to embed images into the stylegan latent space?. In ICCV, Cited by: §2, §3.3.
  2. Y. Blau, R. Mechrez, R. Timofte, T. Michaeli and L. Zelnik-Manor (2018) The 2018 pirm challenge on perceptual image super-resolution. In ECCVW, Cited by: §4.2.
  3. A. Brock, J. Donahue and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
  4. A. Bulat and G. Tzimiropoulos (2018) Super-fan: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans. In CVPR, Cited by: §4.2, Table 2.
  5. Q. Cao, L. Lin, Y. Shi, X. Liang and G. Li (2017) Attention-aware face hallucination via deep reinforcement learning. In CVPR, Cited by: §2.
  6. C. Chen, X. Li, L. Yang, X. Lin, L. Zhang and K. K. Wong (2020) Progressive semantic-aware style transformation for blind face restoration. arXiv:2009.08709. External Links: 2009.08709 Cited by: §1, §2, §3.2, Figure 7, §4.2, Table 1, Table 3.
  7. J. Chen, J. Chen, H. Chao and M. Yang (2018) Image blind denoising with generative adversarial network based noise modeling. In CVPR, Cited by: §2.
  8. Y. Chen, Y. Tai, X. Liu, C. Shen and J. Yang (2018) Fsrnet: end-to-end learning face super-resolution with facial priors. In CVPR, Cited by: §1, §2, §3.3.
  9. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan and J. Feng (2017) Dual path networks. In NeurIPS, Cited by: §2, §3.4.
  10. T. Dai, J. Cai, Y. Zhang, S. Xia and L. Zhang (2019) Second-order attention network for single image super-resolution. In CVPR, Cited by: §2.
  11. J. Deng, J. Guo, N. Xue and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, Cited by: §3.5, §4.2.
  12. B. Dogan, S. Gu and R. Timofte (2019) Exemplar guided face image super-resolution without facial landmarks. In CVPRW, Cited by: §1, §2.
  13. C. Dong, Y. Deng, C. Change Loy and X. Tang (2015) Compression artifacts reduction by a deep convolutional network. In ICCV, Cited by: §1, §2.
  14. C. Dong, C. C. Loy, K. He and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In ECCV, Cited by: §1, §2.
  15. L. Galteri, L. Seidenari, M. Bertini and A. Del Bimbo (2017) Deep generative adversarial compression artifact removal. In ICCV, Cited by: §2.
  16. L. A. Gatys, A. S. Ecker and M. Bethge (2016) Image style transfer using convolutional neural networks. In CVPR, Cited by: §3.5, §4.3.
  17. B. Gecer, S. Ploumpis, I. Kotsia and S. Zafeiriou (2019) Ganfit: generative adversarial network fitting for high fidelity 3d face reconstruction. In CVPR, Cited by: §3.3.
  18. M. W. Gondal, B. Schölkopf and M. Hirsch (2018) The unreasonable effectiveness of texture transfer for single image super-resolution. In ECCV, Cited by: §3.5.
  19. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §1, §2, §3.5, §3.5.
  20. J. Gu, Y. Shen and B. Zhou (2020) Image processing using multi-code gan prior. In CVPR, Cited by: §1, §2, §3.3, §4.2, Table 1, Table 3.
  21. Q. Gu, G. Wang, M. T. Chiu, Y. Tai and C. Tang (2019) Ladn: local adversarial disentangling network for facial makeup and de-makeup. In ICCV, Cited by: §2.
  22. J. Guo and H. Chao (2016) Building dual-domain representations for compression artifacts reduction. In ECCV, Cited by: §2.
  23. Y. Guo, J. Chen, J. Wang, Q. Chen, J. Cao, Z. Deng, Y. Xu and M. Tan (2020) Closed-loop matters: dual regression networks for single image super-resolution. In CVPR, Cited by: §2.
  24. K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu and C. Xu (2020) GhostNet: more features from cheap operations. In CVPR, Cited by: §2, §3.4.
  25. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §3.5, §4.1.
  26. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  27. M. E. Helou, R. Zhou and S. Süsstrunk (2020) Stochastic frequency masking to improve super-resolution and denoising networks. In ECCV, Cited by: §2.
  28. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §4.2.
  29. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §2.
  30. G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report University of Massachusetts, Amherst. Cited by: §4.1.
  31. H. Huang, R. He, Z. Sun and T. Tan (2017) Wavelet-srnet: a wavelet-based cnn for multi-scale face super resolution. In ICCV, Cited by: §2.
  32. R. Huang, S. Zhang, T. Li and R. He (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In CVPR, Cited by: §3.5.
  33. S. Iizuka, E. Simo-Serra and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §2.
  34. J. Johnson, A. Alahi and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.5, §3.5.
  35. T. Karras, T. Aila, S. Laine and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §2, §3.3.
  36. T. Karras, S. Laine and T. Aila (2018) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §1, §2, §3.3, §4.1.
  37. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen and T. Aila (2020) Analyzing and improving the image quality of stylegan. In CVPR, Cited by: §1, §2, §3.1, §3.3, §3.5, §4.1.
  38. D. Kim, M. Kim, G. Kwon and D. Kim (2019) Progressive face super-resolution via attention to facial landmark. In BMVC, Cited by: §2.
  39. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  40. O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin and J. Matas (2018) Deblurgan: blind motion deblurring using conditional adversarial networks. In CVPR, Cited by: §1, §2.
  41. O. Kupyn, T. Martyniuk, J. Wu and Z. Wang (2019) Deblurgan-v2: deblurring (orders-of-magnitude) faster and better. In ICCV, Cited by: §4.2, Table 1, Table 3.
  42. W. Lai, J. Huang, N. Ahuja and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: §3.2.
  43. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz and Z. Wang (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §2, §3.5, §4.2.
  44. S. Lefkimmiatis (2017) Non-local color image denoising with convolutional neural networks. In CVPR, Cited by: §2.
  45. T. Li, R. Qian, C. Dong, S. Liu, Q. Yan, W. Zhu and L. Lin (2018) Beautygan: instance-level facial makeup transfer with deep generative adversarial network. In ACM MM, Cited by: §2.
  46. X. Li, C. Chen, S. Zhou, X. Lin, W. Zuo and L. Zhang (2020) Blind face restoration via deep multi-scale component dictionaries. In ECCV, Cited by: §1, §2, §3.2, §3.4, §4.1, §4.1, §4.2, Table 1, Table 3.
  47. X. Li, W. Li, D. Ren, H. Zhang, M. Wang and W. Zuo (2020) Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In CVPR, Cited by: §1, §2, §3.2.
  48. X. Li, M. Liu, Y. Ye, W. Zuo, L. Lin and R. Yang (2018) Learning warped guidance for blind face restoration. In ECCV, Cited by: §1, §2, §4.1.
  49. Y. Li, S. Liu, J. Yang and M. Yang (2017) Generative face completion. In CVPR, Cited by: §2.
  50. B. Lim, S. Son, H. Kim, S. Nah and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPRW, Cited by: §1, §2.
  51. D. Liu, B. Wen, Y. Fan, C. C. Loy and T. S. Huang (2018) Non-local recurrent network for image restoration. In NeurIPS, Cited by: §2.
  52. J. Liu, W. Zhang, Y. Tang, J. Tang and G. Wu (2020) Residual feature aggregation network for image super-resolution. In CVPR, Cited by: §2.
  53. Z. Liu, P. Luo, X. Wang and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §4.1.
  54. S. Menon, A. Damian, S. Hu, N. Ravi and C. Rudin (2020) PULSE: self-supervised photo upsampling via latent space exploration of generative models. In CVPR, Cited by: §1, §2, §4.2, §4.2, Table 1, Table 3.
  55. A. Mittal, R. Soundararajan and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20 (3), pp. 209–212. Cited by: §4.2.
  56. X. Pan, X. Zhan, B. Dai, D. Lin, C. C. Loy and P. Luo (2020) Exploiting deep generative prior for versatile image restoration and manipulation. In ECCV, Cited by: §1, §2, §3.3.
  57. T. Park, M. Liu, T. Wang and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §2, §3.4.
  58. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Cited by: §3.2.
  59. M. S. Sajjadi, B. Scholkopf and M. Hirsch (2017) Enhancenet: single image super-resolution through automated texture synthesis. In ECCV, Cited by: §2.
  60. Z. Shen, W. Lai, T. Xu, J. Kautz and M. Yang (2018) Deep semantic face deblurring. In CVPR, Cited by: §1, §2, §2.
  61. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.5.
  62. R. Timofte, E. Agustsson, L. Van Gool, M. Yang and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In CVPRW, Cited by: §2, §3.2.
  63. Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao and F. Wen (2020) Bringing old photos back to life. In CVPR, Cited by: §4.2, Table 1, Table 3.
  64. T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, Cited by: §2, §3.5, §4.3.
  65. X. Wang, K. Yu, C. Dong and C. C. Loy (2018) Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR, Cited by: §2, §3.4.
  66. X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao and C. C. Loy (2018) ESRGAN: enhanced super-resolution generative adversarial networks. In ECCVW, Cited by: §2, §3.5, §4.2, Table 2.
  67. L. Xu, J. S. Ren, C. Liu and J. Jia (2014) Deep convolutional neural network for image deconvolution. In NeurIPS, Cited by: §2.
  68. X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister and M. Yang (2017) Learning to super-resolve blurry face and text images. In ICCV, Cited by: §2.
  69. L. Yang, C. Liu, P. Wang, S. Wang, P. Ren, S. Ma and W. Gao (2020-05-11) HiFaceGAN: face renovation via collaborative suppression and replenishment. ACM Multimedia. External Links: 2005.05005 Cited by: §4.2, Table 1, Table 3.
  70. X. Yu, B. Fernando, B. Ghanem, F. Porikli and R. Hartley (2018) Face super-resolution guided by facial component heatmaps. In ECCV, pp. 217–233. Cited by: §1, §2, §3.2, §3.3.
  71. X. Yu, B. Fernando, R. Hartley and F. Porikli (2018) Super-resolving very low-resolution face images with supplementary attributes. In CVPR, Cited by: §2.
  72. K. Zhang, W. Zuo, Y. Chen, D. Meng and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE TIP 26 (7), pp. 3142–3155. Cited by: §1, §2.
  73. R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §4.2.
  74. R. Zhang, P. Isola and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §3.3.
  75. Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, Cited by: §2, §4.2, Table 2.
  76. X. Zhao, Y. Zhang, T. Zhang and X. Zou (2019) Channel splitting network for single mr image super-resolution. IEEE Transactions on Image Processing 28 (11), pp. 5649–5662. Cited by: §2.
  77. J. Zhu, Y. Shen, D. Zhao and B. Zhou (2020) In-domain gan inversion for real image editing. In ECCV, Cited by: §2, §3.3.
  78. S. Zhu, S. Liu, C. C. Loy and X. Tang (2016) Deep cascaded bi-network for face hallucination. In ECCV, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
425278
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description