Identity-preserving Face Recovery from Stylized Portraits

Identity-preserving Face Recovery from Stylized Portraits

Fatemeh Shiri1 F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1
1-Australian National University, 22email: name.surname@anu.edu.au2-Data61/CSIRO, 33email: name.surname@data61.csiro.au
   Xin Yu1 F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1
1-Australian National University, 22email: name.surname@anu.edu.au2-Data61/CSIRO, 33email: name.surname@data61.csiro.au
   Fatih Porikli1 F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1
1-Australian National University, 22email: name.surname@anu.edu.au2-Data61/CSIRO, 33email: name.surname@data61.csiro.au
   Richard Hartley1,2 F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1
1-Australian National University, 22email: name.surname@anu.edu.au2-Data61/CSIRO, 33email: name.surname@data61.csiro.au
   Piotr Koniusz2,1 F. Shiri1, X.Yu1, F. Porikli1, R.Hartley1,2, P. Koniusz2,1
1-Australian National University, 22email: name.surname@anu.edu.au2-Data61/CSIRO, 33email: name.surname@data61.csiro.au
Received: 23.02.2018 / Accepted: 29.01.20191
Abstract

Given an artistic portrait, recovering the latent photorealistic face that preserves the subject’s identity is challenging because the facial details are often distorted or fully lost in artistic portraits. We develop an Identity-preserving Face Recovery from Portraits (IFRP) method that utilizes a Style Removal network (SRN) and a Discriminative Network (DN). Our SRN, composed of an autoencoder with residual block-embedded skip connections, is designed to transfer feature maps of stylized images to the feature maps of the corresponding photorealistic faces. Owing to the Spatial Transformer Network (STN), SRN automatically compensates for misalignments of stylized portraits to output aligned realistic face images. To ensure the identity preservation, we promote the recovered and ground truth faces to share similar visual features via a distance measure which compares features of recovered and ground truth faces extracted from a pre-trained FaceNet network. DN has multiple convolutional and fully-connected layers, and its role is to enforce recovered faces to be similar to authentic faces. Thus, we can recover high-quality photorealistic faces from unaligned portraits while preserving the identity of the face in an image. By conducting extensive evaluations on a large-scale synthesized dataset and a hand-drawn sketch dataset, we demonstrate that our method achieves superior face recovery and attains state-of-the-art results. In addition, our method can recover photorealistic faces from unseen stylized portraits, artistic paintings, and hand-drawn sketches.

Keywords:
Face Synthesis Image Stylization Face Recovery Destylization Generative Models

(a) Original
(b) Portrait
(c) Landmarks
(d) johnson2016perceptual ()
(e) zhu2017unpaired ()
(f) isola2016image ()
(g)  Shiri2017FaceD ()
(h) Ours
Figure 1: Comparisons to the state-of-art methods. (a) The ground truth face image (from test dataset, not available in the training dataset). (b) Unaligned stylized portraits of (a) from Scream style. (c) Landmarks detected by zhang2014facial (). (d) Results obtained by  johnson2016perceptual (). (e) Results obtained by zhu2017unpaired () (CycleGAN). (f) Results obtained by  isola2016image () (pix2pix). (g) Results obtained by Shiri2017FaceD (). (h) Our results.

1 Introduction

Style transferring methods are powerful tools that can generate portraits in various artistic styles from photorealistic images. Unlike prior research on the image stylization, we address a challenging inverse problem of photorealistic face recovery from stylized portraits which aims at recovering a photorealistic image of face from a given stylized portrait. Latent photorealistic face images recovered from their artistic portraits are interpretable for humans and they may be useful in facial analysis. Since facial details and expressions in stylized portraits often undergo severe distortions and become corrupted by artifacts such as profile edges and color changes e.g., as in Figure 1(b), recovering a photorealistic face image from its stylized counterpart is very challenging. In general, stylized face images contain a variety of facial expressions, facial feature distortions and misalignments. Therefore, landmark detectors often fail to localize facial landmarks accurately as shown in Figure 1(c).

While recovering photorealistic images from portraits is still uncommon in the literature, image stylization methods have been widely studied. With the use of Convolutional Neural Networks (CNN), Gatys et al. gatys2016controlling () achieve promising results by transferring different styles of artworks to images via the semantic contents space. Since their method generates the stylized images by iteratively updating the feature maps of CNNs, it is computationally costly. In order to reduce the computational complexity, several feed-forward CNN-based methods have been proposed ulyanov2016texture (); ulyanov2016instance (); johnson2016perceptual (); dumoulin2016 (); li2017diversified (); chen2016fast (); zhang2017multi (); huang2017arbitrary (). However, these methods work only with a single style applied during training. Moreover, such methods are insufficient for generating photorealistic face images as they only capture the correlations of feature maps via Gram matrices thus discarding spatial relations pk_tensor (); me_museum (); power_look_cvpr ().

In order to capture spatial/localized statistics of a style image, several patch-based methods li2016precomputed (); isola2016image () have been developed. However, such methods cannot capture the global appearance of faces, thus failing to generate authentic face images. For instance, patch-based methods li2016precomputed (); isola2016image () fail to attain consistency of face colors, as shown in Figure 8(e). Moreover, the state-of-the-art style transfer methods gatys2016controlling (); li2016precomputed (); ulyanov2016texture (); johnson2016perceptual () transfer desired styles to images without considering the task of identity preservation. Thus, these methods cannot generate realistically looking faces with preserved identities.

Our first face destylization architecture Shiri2017FaceD () uses only a pixel-wise loss in the generative part of the network. Despite being trained on a large-scale dataset, this method fails to recover faces from unaligned portraits under a variety of scales, rotations and viewpoint variations. This journal manuscript is an extension of our second model Shiri2018wacv () which introduces the identity-preserving loss into destylization. Our latest model Shiri2019wacv () performs an identity-preserving face destylization with the use of attributes which allow to manipulate appearance details such as hair color, facial expressions, etc.

In this paper, we develop a novel end-to-end trainable identity-preserving approach to face recovery that automatically maps the unaligned stylized portraits to aligned photorealistic face images. Our network employs two subnetworks: a generative subnetwork, dubbed Style Removal Network (SRN), and a Discriminative Network (DN). SRN consists of an autoencoder (a downsampling encoder and an upsampling decoder) and Spatial Transfer Networks (STN) jaderberg2015spatial (). The encoder extracts facial components from unaligned stylized face images to transfer the extracted feature maps to the domain of photorealistic images. Subsequently, our decoder forms face images. STN layers are used by the encoder and decoder to align stylized faces. Since faces may appear at different orientations, scales and in various poses, the network may not fully capture all this variability if the training data does not account for it. As a result, we would need heavy data augmentation and more training instances with variety of poses in the training dataset to cope with recovery of faces from authentic portraits that may be presented under angle or viewpoint, etc. In contrast to such a costly training, by exploiting STN layers, we require less data to train our network to cope well with images containing face rotations, translations and scale changes. Nonetheless, with or without STN layers, we expose our network during training to images of faces at different scales and rotations to train it how to recover the frontal view. We aim to recover faces in frontal view for visualization purposes (easy to interpret for humans, a face retrieval software works better with frontal views, etc.). The discriminative network, inspired by approaches Goodfellow2014 (); denton2015deep (); yu2016ultra (); yu2017face (), forces SRN to generate destylized faces to be similar to authentic ground truth faces.

As we aim to preserve the information about facial identities, we force the CNN feature representations of recovered faces to be as close to the features of ground truth real faces as possible. For this purpose, we employ pixel-level Euclidean and identity-preserving losses. We also use an adversarial loss to achieve high-quality visual results.

To train our network, pairs of Stylized Face (SF) and ground truth Real Face (RF) images are required. Thus, we synthesize a large-scale dataset of SF/RF pairs. As there exist numerous styles to choose from, we cannot generate faces in all possible styles for training. We note that a Gram matrix formed from features of pre-trained VGG network can capture style details of input images gatys2016image (). Thus, we measure the similarity of various styles via the Log-Euclidean distance jayasumana2013kernel () between Gram matrices of style images and the average Gram matrix of real faces. Based on such a style-distance metric, we select three distinct styles for training.

Moreover, we have observed that CNN filters learned on images of seen styles (used for training) tend to extract meaningful features from images in both seen and unseen styles. Thus, our method can also extract facial information from unseen stylized portraits and generate photorealistic faces, as demonstrated in the experimental section.

Below we list our contributions:

  1. We design a new framework to automatically remove styles from unaligned stylized portraits. Our approach generates facial identities and expressions that match the ground truth face images well (identity preservation).

  2. We propose an autoencoder with skip connections between top convolutional and deconvolutional layers; each skip connection being composed of three residual blocks. These skip connections pass high-level visual features of portraits from convolutional to deconvolutional layers, which leads to an improved restoration performance.

  3. We add an identity-preserving loss to remove seen/unseen styles from portraits preserve underlying identities.

  4. We use STNs as intermediate layers to learn to align non-aligned input portraits. Thus, our method does not use any facial landmarks or 3D models of faces (typically used for face alignment) and requires somewhat fewer augmentations than a network without STNs.

  5. We propose a style-distance metric to capture the most distinct styles for training. Thus, our network achieves a good generalization when tested on unseen styles.

Our large dataset of pairs of stylized and photorealistic faces, and the code will be available on https://github.com/fatimashiri and/or http://claret.wikidot.com.

Figure 2: Our identity-preserving face destylization framework consists of two parts: a style removal network (blue frame) and a discriminative network (red frame). The face recovery network takes portraits as inputs. The discriminative network takes real or recovered face images as inputs.

2 Related Work

In this section, we briefly review neural generative models and deep style transfer methods for image generation.

2.1 Neural Generative Models

There exist many generative models for the problem of image generation oord2016pixel (); kingma2013auto (); oord2016pixel (); Goodfellow2014 (); denton2015deep (); zhang2017image (); Shiri2017FaceD (). Among them, GANs are conceptually closely related to our problem as they employ an adversarial loss that forces the generated images to be as photorealistic as the ground truth images.

Several methods for super-resolution ledig2016photo (); yu2017face (); huang2017beyond (); yu2017hallucinating (); yu2016ultra () and inpainting pathak2016context () adopt an adversarial training to learn a parametric translating function from a large-scale dataset of input-output pairs. These approaches often use the or norm and adversarial losses to compare the generated image to the corresponding ground truth image. Although these methods produce impressive photorealistic images, they fail to preserve identities of subjects.

Conditional GANs have been used for the task of generating photographs from semantic layout/scene attributes karacan2016learning () and sketches sangkloy2016scribbler (). Li and Wand li2016precomputed () train a Markovian GAN for the style transfer – a discriminative training is applied on Markovian neural patches to capture local style statistics. Isola et al. isola2016image () develop “pix2pix” framework which uses so-called “Unet” architecture and the patch-GAN to transfer low-level features from the input to the output domain. For faces, this approach produces visual artifacts and fails to capture the global appearance of faces.

Patch-based methods fail to capture the global appearance of faces and, as a result, they generate poorly destylized images. In contrast, we propose an identity-preserving loss to faithfully recover the most prominent details of faces.

Moreover, there exist several deep learning methods that synthesize sketches from photographs (and vice versa) nejati2011study (); wang2018back (); wang2018high (); sharma2011bypassing (). Wang et al. wang2018back () use the vanilla conditional GAN (cGAN) to generate sketches. However, the cGAN produces sketch-like artifacts in the synthesized faces as well as facial deformations. Wang et al. wang2018high () use the CycleGAN CycleGAN2017 (), and employ multi-scale discriminators to generate high resolution sketches/photos. Their method demonstrates a greatly improved performance. However, it still produces slight blur and/or color degraded artifacts. Kazemi et al. kazemi2018facial () employ Cycle-GAN conditioned on facial attributes in order to enforce desired facial attributes over the images synthesized from sketches. While sketch-to-face synthesis is a related problem, our unified framework works well with a variety of styles more complex than sketches.

2.2 Deep Style Transfer

Style transfer is a technique which can render a given content image (input) according to a specific painting style while preserving the visual contents of the input. We distinguish image optimization and feed-forward style transfer methods. The seminal optimization-based work gatys2016image () transfers the style of an artistic image to a given photograph. It uses iterative optimization to generate a target image from a random initialization (following the Normal distribution). During the optimization step, the statistics of the feature maps of the target, the content and style images are matched.

Gatys et al. gatys2016image () inspired many follow-up studies. Yin yin2016content () presents a content-aware style transfer method which initializes the optimization step with a content image instead of a random noise. Li and Wand li2016combining () propose a patch-based style transfer method which combines Markov Random Field (MRF) and CNN techniques. Gatys et al. gatys2016preserving () transfer the style via linear models and preserve colors of content images by matching color histograms.

Gatys et al. gatys2016controlling () decompose styles into perceptual factors and then manipulate them for the style transfer. Selim et al. selim2016painting () modify the content loss through a gain map for the transfer of paintings of head. Wilmot  et al. wilmot2017stable () use histogram-based losses in their objective and build on the Gatys et al.’s algorithm gatys2016image (). Although the above optimization-based methods further improve the quality of style transfer, they are computationally expensive due to the iterative optimization procedure, thus limiting their practical use.

To address the poor computational speed, feed-forward methods replace the original on-line iterative optimization step with training a feed-forward neural network off-line and generating stylized images on-line ulyanov2016texture (); johnson2016perceptual (); li2016precomputed ().

Johnson et al. johnson2016perceptual () train a generative network for a fast style transfer using perceptual loss functions. The architecture of their generator network follows the work of radford2015unsupervised () and also uses residual blocks. Texture Network  ulyanov2016texture () employs a multi-resolution architecture in the generator network. Ulyanov et al. ulyanov2016instance (); ulyanov2017improved () replace the spatial batch normalization with the instance normalization to achieve a faster convergence. Wang et al. wang2016multimodal () enhance the granularity of the feed-forward style transfer with a multimodal CNN, which performs stylization hierarchically using multiple losses deployed across multiple scales.

These feed-forward methods perform stylization around 1000 faster than the optimization-based methods. However, they cannot adapt to arbitrary styles not used during training. In order to synthesize an image according to a new style, the entire network needs retraining. To deal with such a restriction, a number of recent approaches encode multiple styles within a single feed-forward network dumoulin2016 (); chen2016fast (); chen2017stylebank (); li2017diversified ().

Dumoulin et al. dumoulin2016 () use a so-called conditional instance normalization that learns normalization parameters for each style. Given feature maps of the content and style images, method chen2016fast () replaces content features with the closest matching style features patch-by-patch. Chen et al. chen2017stylebank () present a network that learns a set of new filters for every new style. Li et al. li2017diversified () propose a texture controller which forces the network to synthesize the desired style. We note that the existing feed-forward approaches have to compromise between the generalization li2017diversified (); huang2017arbitrary (); zhang2017multi () and quality ulyanov2017improved (); ulyanov2016instance (); gupta2017characterizing ().

Figure 3: The contribution of each loss function to IFRP network. (a) Unaligned input portraits from the test dataset. (b) Ground truth face images. (c) Recovered faces; only the pixel-wise loss is used (no DN or identity-preserving losses). (d) Recovered faces; the pixel-wise loss and discriminative loss are used (no identity-preserving loss). (e) Our final results with the pixel-wise, discriminative and identity-preserving losses. The use of all three losses produces visually the best results.

3 Proposed Method

Below we present an identity-preserving framework that infers a photorealistic face image from an unaligned stylized face image .

3.1 Network Architecture

Our network consists of two parts: a Style Removal Network (SRN) and a Discriminative Network (DN). SRN is composed of an autoencoder as well as skip connections with residual blocks. The SRN module extracts residual feature maps from input portraits and then upsamples them. To attain high-quality visual performance, we pass visual information from last few layers of encoder to the corresponding layers of decoder. The role of DN is to promote the recovered face images to be similar to their real counterparts. The general architecture of our IFRP framework is depicted in Figure 2.

Style Removal Network. As the goal of face recovery is to generate a photorealistic destylized image, a generative network should be able to remove various styles of portraits without loosing the identity information. To this end, we propose the SRN block which employs a fully convolutional autoencoder (a downsampling encoder and an upsampling decoder) with skip connections and STN layers. Figure 2 shows the architecture of our SRN block (the blue frame).

The autoencoder learns a deterministic mapping to transform images from the space of portraits into some latent space (via an encoder), and a mapping from the latent space to the space of real faces (via a decoder). In this manner, the encoder extracts high-level features of unaligned stylized faces and transforms them into a feature vectors of some latent real face domain while the decoder synthesizes photorealistic faces from these feature vectors.

Moreover, we symmetrically link convolutional and deconvolutional layers via skip-layer connections long2015fully (). These skip connections pass high-resolution visual details of portraits from convolutional to deconvolutional layers, leading to a good quality recovery. In detail, each skip connection comprises three residual blocks. Due to the usage of residual blocks, our network can remove the styles of input portraits and increase the visual quality as shown in Figure 4. In contrast, the same network but without skip connections tends to produce blurry/fuzzy face images as shown in Figure 4. Figure 4 shows that the visual quality improves as components of our architecture are enabled one-by-one.

As input stylized faces are often misaligned due to in-plane rotations, translations and scale changes, we incorporate Spatial Transformer Networks (STNs) jaderberg2015spatial () (green blocks in Figure 2) into the SRN. The STN layer can estimate the motion parameters of face images and warp them to the so-called canonical view. Thus, our method does not require the use of facial landmarks or 3D face models (often used for face alignment). Figure 4 shows that these intermediate STN layers help compensate for misalignment of the input portraits (however, their use is discretionary). The architecture of our STN layers is given in the Appendix A.

For appearance similarity between the recovered faces and their RF ground truth counterparts, we exploit a pixel-wise loss and an identity-preserving loss. The pixel-wise loss enforces intensity-based similarity between images of recovered faces and their ground truth images. The autoencoder supervised by the loss tends to produce oversmooth results as shown in Figure 3. For the identity-preserving loss, we use FaceNet schroff2015facenet () to extract features from images (see Section 3.2 for more details), and then we compare the Euclidean distance between feature maps of two images. In this way, we encourage the feature similarity between recovered faces and their ground truth counterparts. Without the identity-preserving loss, the network produces random artifacts that resemble facial details, such as wrinkles, as shown in Figure 3.

Discriminative Network. Using only the pixel-wise distance between the recovered faces and their ground truth real counterparts leads to oversmooth results, as shown in Figure 3. To obtain appealing visual results, we introduce a discriminator, which forces recovered faces to reside in the same latent space as real faces. Our proposed DN is composed of convolutional layers and fully connected layers, as illustrated in Figure 2 (the red frame). The discriminative loss, also known as the adversarial loss, penalizes the discrepancy between the distributions of recovered and real faces. This loss is also used to update the parameters of the SRN block (we alternate over updates of the parameters of SRN and DN). Figure 3 shows the impact of the adversarial loss on the final results.

Identity Preservation. With the adversarial loss, the SRN is able to generate high-frequency facial content. However, the results often lack details of identities such as the beard or wrinkles, as illustrated in Figure 3. A possible way to address this issue is to constrain the recovered face images and the ground truth face images to share the same face-related visual features e.g., FaceNet features schroff2015facenet ().

Figure 4: The impact of various components of our network on the performance. (a) Ground truth face images. (b) Unaligned input portraits. (c) Results without the use of skip connections/residual blocks in the SRN block similar to the autoencoder in Shiri2017FaceD (). (d) Results with the use of U-net autoencoder. The SRN block similar to the autoencoder in ronneberger2015u () is used. (e) Results with skip connections but without residual blocks in the SRN unit. (f) Results without STN layers in the SRN block. (g) Our final results with skip connections/residual blocks in the SRN.

3.2 Training Details

To train our IFRP network in an end-to-end fashion, we require a large number of SF/RF training image pairs. For each RF, we synthesize different unaligned SF images according to chosen artistic styles to obtain SF/RF training pairs . As described in Section 4, we only use stylized faces from three distinct styles in the training stage.

Motivated by the ideas of Gatys et algatys2016image () and Johnson et aljohnson2016perceptual (), we construct so-called identity-preserving loss. Specifically, we compute the Euclidean distance between the feature maps of the recovered and ground truth images. These feature maps are obtained from the ReLU activations of FaceNet schroff2015facenet ().

Our previous work Shiri2017FaceD () uses only the Euclidean loss to compare the generated and ground truth images which results in blurry images. In this work, we use the FaceNet network for the identity preservation loss and compare FaceNet to VGG-19 which is pre-trained on the large-scale ImageNet dataset containing objects. In contrast, FaceNet is pre-trained on a large dataset of 200 million face identities and 800 million pairs of face images. Therefore, FaceNet can capture visually meaningful facial features. As shown in Figure 5, with the help of FaceNet, our results achieve higher fidelity and better consistency with respect to the ground truth face images. Figure 5 shows the results for VGG-19.

With FaceNet, we can preserve the identity information by encouraging the feature similarity between the generated and ground truth faces. We combine the pixel-wise loss, the adversarial loss and the identity-preserving loss together as our final loss function to train our network. Figure 3 illustrates that, with the help of the identity-preserving loss, our IFRP network can recover satisfying identity-preserving images. Below we explain each loss individually.

Pixel-wise Intensity Similarity Loss. Our goal is to train our feed-forward SRN to produce an aligned photorealistic face image from any given stylized unaligned portrait. To achieve this, we force the recovered face image to be similar to its ground truth counterpart . We denote the output of our SRN as . Since the STN layers are interwoven with the layers of our autoencoder, we optimize the parameters of the autoencoder and the STN layers simultaneously. The pixel-wise loss function between and is expressed as:

(1)

where represents the joint distribution of the SF and RF images in the training dataset, and denotes the parameters of the SRN block.

Identity-preserving Loss. To obtain convincing identity-preserving results, we propose an identity-preserving loss to take the form of the Euclidean distance between the features of recovered face image and the ground truth face image . The identity-preserving loss is given as:

(2)

where denotes the extracted feature maps from the layer ReLU3-2 of the FaceNet model with respect to some input image.

Discriminative Loss. Motivated by the idea of Goodfellow2014 (); denton2015deep (); radford2015unsupervised (), we aim to make the discriminative network fail to distinguish recovered face images from ground truth face images. Therefore, the parameters of the discriminator are updated by minimizing , expressed as:

(3)

where and indicate the distributions of real and recovered face images, respectively, and and are the outputs of for real and recovered face images. The loss is also backpropagated with respect to the parameters of the SRN block.

Our SNR loss is a weighted sum of three terms: the pixel-wise loss, the adversarial loss, and the identity-preserving loss. The parameters are obtained by minimizing the final objective function of the SRN loss given below:

(4)

where and are trade-off parameters for the discriminator and the identity-preserving losses, respectively, and is the distribution of stylized face images.

Since both and are differentiable functions, the error can be backpropagated w.r.t. and by the use of the Stochastic Gradient Descent (SGD) combined with the Root Mean Square Propagation (RMSprop) Hinton (), which helps our algorithm converge faster.

3.3 Implementation Details

The discriminative network is only required in the training phase. In the testing phase, we take SP portraits as inputs and feed them to SRN. The outputs of SRN are the recovered photorealistic face images. We employ convolutional layers with kernels of size and stride in the encoder and deconvolutional layers with kernels of size and stride in the decoder. The feature maps in our encoder are passed to the decoder by skip connections. The batch normalization procedure is applied after our convolutional and deconvolutional layers except for the last deconvolutional layer, similar to the models described in  Goodfellow2014 (); radford2015unsupervised (). For the non-linear activation function, we use the leaky rectifier with piecewise linear units (leakyReLU maas2013rectifier ()), for which the weight of negative slope is set to .

Our network is trained with a mini-batch size of 64, the learning rate set to and the decay rate set to . In all our experiments, parameters and are set to and . As the iterations progress, the images of output faces will be more similar to the ground truth. Hence, we gradually reduce the effect of the discriminative network by decreasing . Thus, , where is the epoch index. The strategy in which we decrease not only enriches the impact of the pixel-level similarity but also helps preserve the discriminative information in the SRN during training. We also decrease to reduce the impact of the identity-preserving constraint after each iteration. Thus, .

Figure 5: The identity preservation loss: comparison of VGG-19 vs. FaceNet. (a) Ground truth. (b) Unaligned input portraits from the test dataset. (c) Recovered faces using VGG-19 simonyan2014very (). (d) Our final results using FaceNet schroff2015facenet ().
Figure 6: Samples from the synthesized dataset. (a) The ground truth aligned real face image. (b)-(k) The synthesized unaligned portraits form Scream, Wave, Candy, Feathers, Sketch, Composition VII,Starry night, Udnie, Mosaic and la Muse styles which have been used for training and testing our network.

As our method is of feed-forward nature (no optimization is required at the test time), it takes 8 ms to destylize a 128128 image.

4 Synthesized Dataset and Preprocessing

To train our IFRP network and avoid overfitting, a large number of SF/RF image pairs are required. To generate a dataset of such pairs, similar to Shiri2017FaceD (), we use the Celebrity dataset (CelebA) Liu2015faceattributes (). Firstly, we randomly select 110K faces from the CelebA dataset for training and 2K face images for testing. The original size of images is pixels. Subsequently, we crop/extract the center of each image and resize it to pixels. We use such cropped images as our RF ground truth face images . Lastly, we apply affine transformations to the aligned ground truth face images to generate in-plane unaligned face images.

Moreover, to synthesize our training dataset, we retrain the real-time style transfer network johnson2016perceptual () for different artworks. We use only three distinct styles, Scream, Candy and Mosaic for synthesizing our training dataset. The procedure detailing how we selected these styles is explained in Section 5.1. We also use 2K unaligned ground truth face images to synthesize 20K SF images from ten diverse styles (Scream, Wave, Candy, Feathers, Sketch, Composition VII, Starry night, Udnie, Mosaic and la Muse) as our testing dataset. Note that we also include artistic sketches as an unseen style into our test dataset. Some stylized face images used for training and testing are shown in Figure 6. Lastly, we emphasize that there is no overlap between the training and testing datasets.

Number of Training Styles Training time per epoch Seen Styles Unseen Styles
SSIM FSIM SSIM FSIM
1 Style 1:49’ 0.69 0.72 0.54 0.66
2 Styles 3:54’ 0.70 0.77 0.60 0.78
3 Styles 5:20’ 0.72 0.88 0.68 0.84
4 Styles 7:05’ 0.72 0.88 0.68 0.85
5 Styles 9:47’ 0.73 0.88 0.69 0.85
Table 1: The number of training styles and the corresponding training times.
Number of Training Styles Number of epochs
Without STNs With STN
1 Style 203 180
2 Styles 181 159
3 Styles 165 143
4 Styles 150 135
5 Styles 139 121
Table 2: The number of training epochs vs. the number of styles (for the same number of augmentations).

5 Experiments

Below we compare the performance of our approach qualitatively and quantitatively to the state-of-the-art methods. To conduct a fair comparison, we retrain approaches gatys2016image (); johnson2016perceptual (); li2016precomputed (); isola2016image (); zhu2017unpaired (); Shiri2017FaceD () on our training dataset for the task of photorealistic face recovery from portraits.

5.1 Style-Distance Metric

Generating/training on large numbers of styles is impractical. Thus, we propose a style-distance metric to select the most difficult styles for the process of face recovery. For this purpose, we compute Gram matrices for various styles from feature maps of pre-trained VGG model simonyan2014very (). Then, we measure the similarity of styles based on the Log-Euclidean distance jayasumana2013kernel () between Gram matrices of style images and the average Gram matrix of all ground truth face images in our training dataset. As these Gram matrices reflect the style differences between input images gatys2016image (), we choose three styles with the largest distances from the average Gram matrix of ground truth face images. According to the above criterion, we select Candy, Mosaic and Scream styles for training.

Utilizing additional training styles can improve the quality of recovered images especially from unseen styles at a cost of extra training time. We have observed that using three training styles is optimal (using more styles does not improve the accuracy significantly). Table 1 summarizes the average SSIM and FSIM scores on the test dataset given different number of training styles. As the number of training styles increases, our network learns a better mapping between different genres of stylized portraits and ground truth face images. In order to help the network learn a mapping between unaligned and aligned data, we use STN layers that reduce the number of epochs needed for convergence. Table 2 shows that for training with 3 styles, our network converges after 165 epochs (without STNs) and 143 epochs (with STNs). We note that when the network is trained without STN layers, its visual performance is somewhat worse to the results which rely on STNs.

(a) RF
(b) SF (c) gatys2016image () (d) johnson2016perceptual () (e) li2016precomputed () (f) isola2016image () (g) zhu2017unpaired () (h) Shiri2017FaceD () (i) Ours
Figure 7: Qualitative comparisons of the state-of-the-art methods. (a) The ground truth face image. (b) Input portraits (from the test dataset) including the seen styles Scream, Mosaic and Candy as well as the unseen styles Sketch, Composition VII, Feathers, Udnie and La Muse. (c) Gatys et al.’s method gatys2016image (). (d) Johnson et al.’s method johnson2016perceptual (). (e) Li and Wand’s method li2016precomputed () (MGAN). (f) Isola et al.’s method isola2016image () (pix2pix). (g) Zhu et al.’s method zhu2017unpaired () (CycleGAN). (h) Shiri et al.’s method Shiri2017FaceD () (i) Our method.

5.2 Qualitative Evaluation

We visually compare our approach against six methods detailed below. To help these methods achieve their best performance, we align SF images from the test dataset via a simple STN-based network prior to testing.

Gatys et al. gatys2016image () is an image optimization-based style transfer method which does not have any training stage. This method captures the correlation between feature maps of the portrait and the synthesized face via Gram matrices constructed from features extracted across several layers of a CNN pipeline. Thus, spatial structure of face images cannot be preserved by this approach. As shown in Figures 7(c) and 8(c), the network fails to remove various aspects of artistic styles and thus produces visually unconvincing results.

We also retrain the approach of Johnson et al. johnson2016perceptual () for destylization. Due to the use of correlation statistics captured via the Gram matrix, their network also generates distorted facial details and produces unnatural artifacts. As Figures 7(d) and 8(d) show, the facial details are blurred and the skin colors are not homogeneous. Moreover, Figure 8(d) shows many images containing unnaturally looking eyes due to poor destylization abilities of approach johnson2016perceptual ().

MGAN li2016precomputed () is a patch-based style transfer method. We retrain this network for the purpose of the face recovery. As this method is trained on RF/SF patches, it cannot capture the global appearance of faces. As shown in Figures 7(e) and 8(e), this method produces distorted results and the facial colors are inconsistent. In contrast, our method successfully captures the global appearance of faces and generates consistent facial colors.

Isola et al. isola2016image () train a “U-net” generator augmented with a PatchGAN discriminator in an adversarial framework, known as “pix2pix”. Since the patch-based discriminator is trained to classify whether an image patch is sampled from the distr. of real faces or not, this network does not take the global appearance of faces into account. In addition, U-net concatenates low-level features from the bottom layers of the encoder with the features in the decoder to generate face images. As the low-level features of input images are passed to the output, U-net fails to eliminate artistic styles in face images. As shown in Figures 7(f) and 8(f), pix2pix can generate acceptable results for the seen styles but fails to remove the unseen styles and thus it produces obvious artifacts.

(a) RF
(b) SF (c) gatys2016image () (d) johnson2016perceptual () (e) li2016precomputed () (f) isola2016image () (g) zhu2017unpaired () (h) Shiri2017FaceD () (i) Ours
Figure 8: Qualitative comparisons of the state-of-the-art methods. (a) The ground truth real face image. (b) Input portraits (from the test dataset) including the seen styles Candy, Mosaic and Scream as well as the unseen styles Udnie, La Muse, Starry, Feathers and Composition VII. (c) Gatys et al.’s method gatys2016image (). (d) Johnson et al.’s method johnson2016perceptual (). (e) Li and Wand’s method li2016precomputed () (MGAN). (f) Isola et al.’s method isola2016image () (pix2pix). (g) Zhu et al.’s method zhu2017unpaired () (CycleGAN). (h) Shiri et al.’s method Shiri2017FaceD () (i) Our method.

CycleGAN zhu2017unpaired () is an image-to-image translation method that uses unpaired datasets. This network provides a mapping between two different domains by the use of a cycle-consistency loss. Since CycleGAN also employs a patch-based discriminator, this network cannot capture the global appearance of faces. As CycleGAN employs unpaired face datasets for RF and SF images, the low-level features of the stylized and recovered faces are uncorrelated. Thus, CycleGAN is not suitable for transferring stylized portraits to photorealistic images. As shown in Figures 7(g) and  8(g), this method produces distorted face images and it does not preserve the identities of faces in the input images.

Our first destylization approach Shiri2017FaceD () does not exploit an identity-preserving loss as it employs only a simple autoencoder to recover photorealistic face images. In contrast, in this paper we study an identity-preserving loss that helps recover photorealistic face images which preserve underlying identities. We utilize 330K pairs of SF/RF face images. Our IFRP method is robust in terms of recovery of realistic faces. As shown in Figure 1(g), our old method suffers for instance from poor recovery of hair color. As shown in the fourth row of Figures 7(c)7(h), all methods, except for ours in Figure 7(i), fail to recover the correct facial complexion. As shown in the fourth row of Figures 8(e)8(h), these methods cannot recover male’s beard. In contrast, in Figure 8(i), our method is shown to recover well such an important facial feature.

Compared to our previous approach and other methods, our new method attains a higher fidelity and better consistency with regards to facial expressions and skin tones. Our network can preserve the identity of a subject given either seen or unseen styles, as shown in Figures 7(i) and  8(i).

(a) SF
(b) RF
(c) gatys2016image ()
(d) johnson2016perceptual ()
(e) isola2016image ()
(f) zhu2017unpaired ()
(g) Shiri2017FaceD ()
(h) Ours
Figure 9: Qualitative comparisons of the state-of-the-art methods. (a) Input portraits (from the test dataset) including seen and unseen styles. (b) Ground truth face images. (c) Gatys et al.’s method gatys2016image (). (d) Johnson et al.’s method johnson2016perceptual (). (e) Isola et al.’s method isola2016image () (pix2pix). (f) Zhu et al.’s method zhu2017unpaired () (CycleGAN). (g) Shiri et al.’s method Shiri2017FaceD () (h) Our method.

5.3 Quantitative Evaluation

Face Reconstruction Analysis. To evaluate the reconstruction performance, we measure the average Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM) wang2004image () and Feature Similarity (FSIM) zhang2011fsim () on the entire test dataset. Table 3 indicates that our IFRP method achieves superior quantitative results in comparison to other methods on both seen and unseen styles. Moreover, we also evaluate different methods on sketch images from the CUFSF dataset as an unseen style without fine-tuning or retraining our network.

Methods Seen Styles Unseen Styles Unseen Sketches
PSNR SSIM FSIM PSNR SSIM FSIM PSNR SSIM FSIM
Gatys  gatys2016image () 20.18 0.57 0.73 20.25 0.57 0.66 19.93 0.55 0.67
Johnson johnson2016perceptual () 15.65 0.34 0.68 15.81 0.33 0.70 16.27 0.35 0.68
MGAN li2016precomputed () 16.22 0.44 0.64 16.17 0.47 0.60 16.01 0.46 0.61
pix2pix isola2016image () 20.82 0.59 0.80 18.90 0.54 0.67 19.01 0.55 0.66
CycleGAN zhu2017unpaired () 18.58 0.32 0.69 15.89 0.27 0.64 15.65 0.31 0.65
Shiri Shiri2017FaceD () 21.57 0.58 0.79 20.21 0.56 0.70 21.35 0.57 0.71
IFRP 26.08 0.72 0.88 24.83 0.68 0.84 24.89 0.68 0.83
Table 3: Comparisons of PSNR, SSIM and FSIM on the entire test dataset.
Methods FRR FCR
Seen Styles Unseen Styles Unseen Sketch
Gatys  gatys2016image () 64.67% 62.28% 68.36% 72.89%
Johnson johnson2016perceptual () 50.54% 38.87% 40.27% 44.99%
MGAN li2016precomputed () 26.97% 22.52% 24.99% 38.24%
pix2pix isola2016image () 75.13% 59.98% 66.63% 87.73%
CycleGAN zhu2017unpaired () 25.07% 25.68% 26.70% 24.97%
Shiri Shiri2017FaceD () 84.51% 75.32% 76.44% 89.09%
IFRP 90.93% 84.92% 89.05% 92.06%
Table 4: Comparisons of FRR and FCR on the entire test dataset.

Face Retrieval Analysis. Below we demonstrate that the faces recovered by our method are highly consistent with their ground truth counterparts. To this end, we run a face recognition algorithm parkhi2015deep () on our test dataset for both seen and unseen styles. For each investigated method, we set 1K recovered faces from one style as a query dataset and then we set 1K of ground truth faces as a search dataset. We apply parkhi2015deep () to quantify whether the correct person is retrieved within the top-5 matched images. Then an average retrieval score is obtained. We repeat this procedure for every style and then obtain the average Face Retrieval Ratio (FRR) by averaging all scores from the seen and unseen styles, respectively. As indicated in Table 4, our IFRP network outperforms the other methods across all the styles. Even for the unseen styles, our method can still retain most identity-preserving features, making the destylized results similar to the ground truth faces. Moreover, we also run an experiment on hand-drawn sketches of the CUFSF dataset used as an unseen style. The FRR scores are higher compared to results on other styles as facial components are easier to extract from sketches/their contours. Despite our method is not dedicated to face retrieval, we compare it to zhang2011coupled (). To challenge our method, we retrain our network on sketches. To this end, we recovered faces from sketches (the CUFSF dataset) and performed face identification that yielded 91% Verification Rate (VR) at FAR=0.1%. This outperforms photo-synthesizing approach MRF+LE zhang2011coupled () which uses sketches for training and yields 43.66% VR at FAR=0.1%.

Consistency Analysis w.r.t. Various Styles. As shown in Figures 7(i) and 8(i), our network recovers photorealistic face images from various stylized portraits of the same person. Note that recovered faces resemble each other. It indicates that our network is robust to different styles.

In order to demonstrate the robustness of our network to different styles quantitatively, we study the consistency of faces recovered from different styles. First, we choose 1K face images destylized from a single style. For each destylized face, we search its top-5 most similar face images in a group of face images destylized from portraits in remaining styles. If the same person is retrieved within the top-5 candidates, we record this as a hit. Then an average hit number for a given style is obtained. We repeat the same procedure for each of the other 7 styles, and then we calculate the average hit number, denoted as Face Consistency Ratio (FCR). Note that the probability of one hit by chance is 0.5. Table 4 shows average FCR scores on the test dataset for each method. The FCR scores indicate that our IFRP method produces the most consistent destylized face images across different styles. This also implies that our SRN can extract facial features irrespective of image styles.

Loss Function Seen Styles Unseen Styles
SSIM FSIM SSIM FSIM
0.60 0.72 0.54 0.65
+ 0.62 0.75 0.58 0.72
IFRP (++) 0.72 0.88 0.68 0.84
Table 5: Quantitative comparisons of the impact of each of our losses.

5.4 Impact of Different Losses on Performance.

Below we discuss the impact of our losses on the visual results shown in Figure 3 and we present corresponding quantitative evaluations in Table 5. Figure 3 shows that employing only the pixel-wise loss leads to the visual recovery which suffers from severe blur, as loss acts on the intensity similarity only. To avoid generating overly smooth results, the discriminative loss is employed by us in our network. Similar to findings of our previous work Shiri2017FaceD (), the discriminative loss encourages the generated faces to be realistic, thus it improves the final results qualitatively and quantitatively. The weight/impact of the discriminative loss is chosen experimentally with value of being a good compromise between excessively smooth and sharp results. However, due to the lack of the guidance of high-level semantic information (parts of such information are locally lost in stylized portraits), the network with the pixel-wise and discriminative losses still generates artifacts to mimic facial details. As shown in Figure 3 (top), the network still generates ambiguous results such as gender reversal or mismatched hair color. By employing all three losses together, that is, the identity-preserving loss , discriminative loss and the pixel-wise loss , our network attains the best visual recovery. The same findings are confirmed by the quantitative results in Table  5.

5.5 Ablation Study of the Proposed Architecture.

We perform the ablation study of different components of the proposed IFRP architecture and present visual results in Figure 4. In order to demonstrate the contribution of each component to the quantitative results, we also show the quantitative results of our network in Table 6. When only employing a standard autoencoder with a stack of convolutional layers followed by a series of deconvolutional layers, the visual results suffer from blurriness and artifacts as shown in Figure 4. The network generates misguided results such as wrong hair texture, lack of lipstick, wrong lip expression, etc. To avoid generating overly smooth results, the skip connections between two top layers are applied in our network. By employing residual blocks between skip connection of top two layers, our network is able to achieve the best results qualitatively and quantitatively. In this manner, we put emphasis on high-level semantic information.

SRN Architecture Seen Styles Unseen Styles
SSIM FSIM SSIM FSIM
Standard Autoencoder 0.65 0.84 0.62 0.80
U-net Autoencoder 0.65 0.87 0.61 0.78
Top 2-layer skip conn. 0.66 0.86 0.63 0.82
IFRP: 2-layer skip conn.+Res. blocks 0.72 0.88 0.68 0.84
Table 6: Quantitative comparisons of the impact of various IFRP network components.
Rotation Angles (degrees) Without STNs With STN
-30, -20, -15, -10, -5, 0, 5, 10, 15, 20, 30 0.64 0.66
-30, -15, 0, 15, 30 0.64 0.65
Table 7: SSIM as the function of the number of in-plain rotation-based augmentations of SF images used during training.
Figure 10: Examples of images recovered from portraits of different ethnicity and age groups. First row: RF ground truth faces. Second row: unaligned input portraits. Third row: our recovery results. (a-d) Faces of very old or young subjects. (e-f) Faces of dark skinned subjects. (g-i) Faces of Asian subjects.

5.6 Robustness of IFRP w.r.t. Ethnicity and Age

We note that the numbers of images of children, old people and young adults in the CelebA dataset are unbalanced e.g., there are more images of young adults than children and old people. Moreover, the number of images of people of white complexion is larger compared to those of dark skin tones. The number of Asian faces is also limited in the CelebA dataset. Unfortunately, these factors make our synthesized dataset unbalanced. However, due to the identity-preserving loss we use, our network can cope with faces of different nationalities, skin tones and ages reasonably well. Figure 10 shows the visual results obtained by our network given faces of various ethnicity and age. Our results are consistent with the ground truth face images. However, some age-related facial features such as children’s missing teeth in Figure 10 (bottom) are especially hard to recover faithfully as CelebA does not feature celebrities with missing teeth etc.

5.7 Robustness of IFRP w.r.t. Misalignments

Below we conduct some qualitative and quantitative experiments to show the robustness of our network to misalignments. Figure 11 shows the visual results of our network on faces rotated within range [-45; 45] degrees. Thanks to STN layers, our network is able to recover photorealistic faces even from portraits rotated by -45 or +45 degrees. Figure 12 shows PSNR of our network as a function of the rotation angle. Moreover, our network is also robust to scaling of portraits. Figure 13 shows the successfully recovered faces from portraits containing faces captured at different scales. Figure 14 shows PSNR of our network as a function of the scale factor.

Moreover, Table 7 shows SSIM scores for a single-style training with only in-plane rotations of SF used during training. The table shows that using STN layers benefits results. However, using STNs is only a discretionary choice.

(a) -45
(b) -30
(c) -15
(d) 0
(e) +15
(f) +30
(g) +45
Figure 11: The effect of STN layers on recovery from unaligned rotated portraits ([-45; 45] degrees range). First row: the ground truth face image. Second row: unaligned rotated portraits using Candy style. Last row: our aligned results.
Figure 12: Performance of our IFRP w.r.t. the rotation angle of faces.
(a) 0.7x
(b) 1x
(c) 1.3x
Figure 13: The effect of STN layers on portrait scaling. First row: the ground truth face image. Second row: unaligned stylized faces using Mosaic style. Last row: our aligned results.
Figure 14: Performance of our IFRP w.r.t. the scale of faces.
Figure 15: Recovering photorealistic faces from hand-drawn sketches from the FERET dataset. First row: ground truth face images. Second row: sketches. Third row: results of Wang et al.’s method wang2018high (). Bottom row: our results.

5.8 User Experience Study

As human perception is sensitive to the slightest imperfections and artifacts of faces, we conducted a user study to verify if subjects find convincing our recovered results.

Our evaluation dataset contains faces recovered from 20 stylized portraits by the state-of-the-art methods as well as our IFRP method (see an example in Figure 9). We chose a diverse subset of portraits in terms of race, gender, age, hair style, skin color, make up, etc. Our study included 25 subjects (graduate students). For each portrait, the Ground truth face and seven images (the faces recovered by gatys2016image (); johnson2016perceptual (); li2016precomputed (); isola2016image (); zhu2017unpaired (); Shiri2017FaceD () as well as our method) were shown in random order side-by-side on high-quality color printouts. The subjects were asked to rate the printouts according to the visual quality and perceived fidelity of identity with respect to the corresponding ground truth images. Figure 16 summarizes the average scores of this study. For all portraits, our results are rated higher than other state-of-the-art methods. The subjects rated higher the printouts which preserve the subjects’ identities better and contain no visible artifacts. As this simple user study shows, our results are favored by the users as they find faces recovered by our algorithm to be the closest to the original images. This study is consistent with our numerical evaluations.

        gatys2016image () johnson2016perceptual () li2016precomputed () isola2016image () zhu2017unpaired () Shiri2017FaceD () Ours
Figure 16: User study. Results comparing our IFRP and other state-of-the-art methods. Vertical axis is the percentages of favorable user votes.

5.9 Destylizing Authentic Paintings and Sketches

Below we demonstrate that our method is not restricted to the recovery of faces from computer-generated stylized portraits but it can also work with real paintings, sketches and unknown styles. To verify this assertion, we choose a few of paintings from art galleries such as Archibald archibald (). Next, we crop face regions from the scanned images and use them as our test images. Figure 18 shows that our method can efficiently recover photorealistic face images. This indicates that our method is not limited to the synthesized data and it does not require an alignment procedure beforehand.

We also conduct an experiment on hand-drawn sketches from the FERET dataset phillips1998feret (). We compare our results with one of the most recent sketch-to-face methods. Method wang2018high () works with sketches only (c.f. complex stylized faces) and requires landmarks to perform the face alignment (c.f. our method which does not need any face alignment due to STN layers). Note that wang2018high () uses CycleGAN with multipatch-based discriminators to generate sketches/photos. Figure 15 shows the comparison of our method with method wang2018high () which is not fully supervised and tends to produce artifacts. In contrast, our method can efficiently recover photorealistic face images from sketches and it results in fewer artifacts due to the identity-preserving loss.

(a) Seen
(b) Unseen
(c) Seen
(d) Unseen
(e) Unseen
Figure 17: Limitations. Top row: ground truth face images. Middle row: unaligned stylized face images. Bottom row: our results.
Figure 18: Recovery results for the authentic unaligned paintings. Top row: the original portraits from art galleries. Bottom row: our results.
Figure 19: Results of our IPFR approach on the Church Outdoor dataset yu15lsun (). Top row: ground truth images. Middle row: stylized images. Bottom row: our results.

5.10 Image Recovery from Generic Artworks

Below we conduct an experiment on the Church outdoor dataset to show that our network can recover photorealistic images from generic artworks (c.f. portraits). Figure 19 demonstrates the ground truth images, stylized images and images recovered by our IFRP network, respectively. We note that the diversity of outdoor images is much larger than those of faces. Therefore, as expected, for a reliable training and recovery of generic scenes, big datasets are needed. Nonetheless, our recovered results are visually convincing.

5.11 Limitations on Unseen Styles

We have noted that our network is able to recover peripheral non-facial details for styles both seen and used during training. Figure 17 shows that the background color and texture for seen styles (Mosaic and Scream) are fully recovered as the background information is encoded in the stylized images. For styles (Composition VII and Sketch) unseen during training, our network hallucinated backgrounds inconsistent with the ground truth backgrounds. As expected in this sanity check, the recovered background colors and textures for unseen stylized portraits do not match the ground truth.

6 Conclusions

We have introduced a novel neural network for the face recovery from stylized portraits. Our method extracts features from a given unaligned stylized portrait and then recovers a photorealistic face image from these features. The SRN successfully learns a mapping from unaligned stylized faces to aligned photorealistic faces. Our identity-preserving loss further encourages our network to generate identity trustworthy faces. This makes our algorithm readily available for the use in face hallucination, recovery and recognition. We have shown that our approach can recover images of faces from portraits of unseen styles, real paintings and sketches. Lastly, our approach can also recover some generic scenes and objects. In the future, we intend to embed semantic information into our network to generate more consistent face images in terms of semantic details.

Acknowledgements.
This work is supported by the Australian Research Council (ARC) grant DP150104645.

Appendix A Face Alignment: Spatial Transfer Networks (STN).

As described in Section 3.1, we incorporate multiple STNs jaderberg2015spatial () as intermediate layers to compensate for misalignments and in-plane rotations. The STN layers estimate the motion parameters of face images and warp them to a canonical view. Each STN contains localization, grid generator and sampler modules. The localization module consists of several hidden layers to estimate the transformation parameters with respect to the canonical view. The grid generator module creates a sampling grid according to the estimated parameters. Finally, the sampler module maps the input feature maps into generated girds using the bilinear interpolation. The architecture of our STN layers is detailed in Tables 8910 and  11.

Appendix B Contributions of each Component in the IFRP Network.

In Section 3, we described the impact of the loss, the adversarial loss and the identity-preserving loss on the face recovery from portraits. Figure 20 further shows the contribution of each loss function to the final results.

Appendix C Visual Comparison with the State of the Art.

Below, we provide several additional results demonstrating the performance of our IFRP network compared to the state-of-art approaches (Figure 21).

STN1
Input: 64 x 64 x 32
3 x 3 x 64 conv, relu, Max-pooling(2,2)
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)
Table 8: The STN1 architecture.
STN2
Input: 32 x 32 x 64
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)
Table 9: The STN2 architecture.
STN3
Input: 16 x 16 x 128
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)
Table 10: The STN3 architecture.
STN4
Input: 32 x 32 x 64
3 x 3 x 64 conv, relu, Max-pooling(2,2)
3 x 3 x 128 conv, relu, Max-pooling(2,2)
3 x 3 x 256 conv, relu, Max-pooling(2,2)
3 x 3 x 20 conv, relu
fully connected (80,20), relu
fully connected (20,4)
Table 11: The STN4 architecture.
Figure 20: More results showing the contribution of each loss function in our IFRP network. (a) Ground truth face images. (b) Input unaligned portraits from test dataset. (c) Recovered face images; the pixel-wise loss was used in training (no DN or identity-preserving losses). (d) Recovered face images; the pixel-wise loss and discriminative loss were used (no identity-preserving loss). (e) Our final results with the pixel-wise loss, discriminative loss and identity-preserving loss used during training.
(a) RF (b) SF (c) gatys2016image () (d) johnson2016perceptual () (e) li2016precomputed () (f) isola2016image () (g) zhu2017unpaired () (h) Shiri2017FaceD () (i) Ours
Figure 21: Additional qualitative comparisons of the state-of-the-art methods. (a) Ground truth face images. (b) Input portraits (from the test dataset) including the seen styles Scream, Mosaic and Candy as well as the unseen styles Sketch, Composition VII, Feathers, Udnie and La Muse. (c) Gatys et al.’s method gatys2016image (). (d) Johnson et al.’s method johnson2016perceptual (). (e) Li and Wand’s method li2016precomputed () (MGAN). (f) Isola et al.’s method isola2016image () (pix2pix). (g) Zhu et al.’s method zhu2017unpaired () (CycleGAN). (h) Shiri et al.’s method Shiri2017FaceD () (i) Our method.

References

  • (1) Archibald prize; art gallery of nsw. https://www.artgallery.nsw.gov.au/prizes/archibald/. https://www.artgallery.nsw.gov.au/prizes/archibald/ (2017)
  • (2) Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: An explicit representation for neural image style transfer. arXiv preprint arXiv:1703.09210 (2017)
  • (3) Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016)
  • (4) Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: NIPS (2015)
  • (5) Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
  • (6) Gatys, L.A., Bethge, M., Hertzmann, A., Shechtman, E.: Preserving color in neural artistic style transfer. arXiv preprint arXiv:1606.05897 (2016)
  • (7) Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)
  • (8) Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. arXiv preprint arXiv:1611.07865 (2016)
  • (9) Goodfellow, I., Pouget-Abadie, J., Mirza, M.: Generative Adversarial Networks. In: NIPS (2014)
  • (10) Gupta, A., Johnson, J., Alahi, A., Fei-Fei, L.: Characterizing and improving stability in neural style transfer. arXiv preprint arXiv:1705.02092 (2017)
  • (11) Hinton, G.: Neural Networks for Machine Learning Lecture 6a: Overview of mini-batch gradient descent Reminder: The error surface for a linear neuron
  • (12) Huang, R., Zhang, S., Li, T., He, R.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017)
  • (13) Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868 (2017)
  • (14) Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016)
  • (15) Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS, pp. 2017–2025 (2015)
  • (16) Jayasumana, S., Hartley, R., Salzmann, M., Li, H., Harandi, M.: Kernel methods on the riemannian manifold of symmetric positive definite matrices. In: CVPR (2013)
  • (17) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV. Springer (2016)
  • (18) Karacan, L., Akata, Z., Erdem, A., Erdem, E.: Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215 (2016)
  • (19) Kazemi, H., Iranmanesh, M., Dabouei, A., Soleymani, S., Nasrabadi, N.M.: Facial attributes guided deep sketch-to-photo synthesis. In: Computer Vision Workshops (WACVW), 2018 IEEE Winter Applications of, pp. 1–8. IEEE (2018)
  • (20) Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • (21) Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation and beyond. ECCV pp. 788–804 (2018)
  • (22) Koniusz, P., Yan, F., Gosselin, P., Mikolajczyk, K.: Higher-order occurrence pooling for bags-of-words: Visual concept detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(2), 313–326 (2016)
  • (23) Koniusz, P., Zhang, H., Porikli, F.: A deeper look at power normalizations. IEEE Conference on Computer Vision and Pattern Recognition pp. 5774–5783 (2018)
  • (24) Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016)
  • (25) Li, C., Wand, M.: Combining markov random fields and convolutional neural networks for image synthesis. In: CVPR (2016)
  • (26) Li, C., Wand, M.: Precomputed real-time texture synthesis with markovian generative adversarial networks. In: ECCV, pp. 702–716. Springer (2016)
  • (27) Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Diversified texture synthesis with feed-forward networks. arXiv preprint arXiv:1703.01664 (2017)
  • (28) Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
  • (29) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
  • (30) Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)
  • (31) Nejati, H., Sim, T.: A study on recognizing non-artistic face sketches. In: WACV. IEEE (2011)
  • (32) Oord, A.v.d., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
  • (33) Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: BMVC (2015)
  • (34) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR (2016)
  • (35) Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.J.: The feret database and evaluation procedure for face-recognition algorithms. Image and vision computing 16(5), 295–306 (1998)
  • (36) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • (37) Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. Springer (2015)
  • (38) Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: Controlling deep image synthesis with sketch and color. arXiv preprint arXiv:1612.00835 (2016)
  • (39) Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR (2015)
  • (40) Selim, A., Elgharib, M., Doyle, L.: Painting style transfer for head portraits using convolutional neural networks. ACM (TOG) 35(4), 129 (2016)
  • (41) Sharma, A., Jacobs, D.W.: Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch. In: CVPR. IEEE (2011)
  • (42) Shiri, F., Yu, X., Koniusz, P., Porikli, F.: Face destylization. In: International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2017). DOI 10.1109/DICTA.2017.8227432
  • (43) Shiri, F., Yu, X., Porikli, F., Hartley, R., Koniusz, P.: Identity-preserving face recovery from portraits. In: WACV, pp. 102–111 (2018). DOI 10.1109/WACV.2018.00018
  • (44) Shiri, F., Yu, X., Porikli, F., Hartley, R., Koniusz, P.: Recovering faces from portraits with auxiliary facial attributes. In: WACV (2019)
  • (45) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • (46) Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML (2016)
  • (47) Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
  • (48) Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. arXiv preprint arXiv:1701.02096 (2017)
  • (49) Wang, L., Sindagi, V., Patel, V.: High-quality facial photo-sketch synthesis using multi-adversarial networks. In: Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on, pp. 83–90. IEEE (2018)
  • (50) Wang, N., Zha, W., Li, J., Gao, X.: Back projection: an effective postprocessing method for gan-based face sketch synthesis. Pattern Recognition Letters 107, 59–65 (2018)
  • (51) Wang, X., Oxholm, G., Zhang, D., Wang, Y.F.: Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. arXiv preprint arXiv:1612.01895 (2016)
  • (52) Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
  • (53) Wilmot, P., Risser, E., Barnes, C.: Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893 (2017)
  • (54) Yin, R.: Content aware neural style transfer. arXiv preprint arXiv:1601.04568 (2016)
  • (55) Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  • (56) Yu, X., Porikli, F.: Ultra-resolving face images by discriminative generative networks. In: ECCV (2016)
  • (57) Yu, X., Porikli, F.: Face hallucination with tiny unaligned images by transformative discriminative neural networks. In: AAAI (2017)
  • (58) Yu, X., Porikli, F.: Hallucinating very low-resolution unaligned and noisy face images by transformative discriminative autoencoders. In: CVPR (2017)
  • (59) Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. arXiv preprint arXiv:1703.06953 (2017)
  • (60) Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957 (2017)
  • (61) Zhang, L., Zhang, L., Mou, X., Zhang, D., et al.: Fsim: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20(8), 2378–2386 (2011)
  • (62) Zhang, W., Wang, X., Tang, X.: Coupled information-theoretic encoding for face photo-sketch recognition. In: CVPR. IEEE (2011)
  • (63) Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: ECCV. Springer (2014)
  • (64) Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
  • (65) Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
350910
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description