UGAN: Untraceable GAN for Multi-Domain Face Translation
The multi-domain image-to-image translation is a challenging task where the goal is to translate an image into multiple different domains. The target-only characteristics are desired for translated images, while the source-only characteristics should be erased. However, recent methods often suffer from retaining the characteristics of the source domain, which are incompatible with the target domain. To address this issue, we propose a method called Untraceable GAN, which has a novel source classifier to differentiate which domain an image is translated from, and determines whether the translated image still retains the characteristics of the source domain. Furthermore, we take the prototype of the target domain as the guidance for the translator to effectively synthesize the target-only characteristics. The translator is learned to synthesize the target-only characteristics and make the source domain untraceable for the discriminator, so that the source-only characteristics are erased. Finally, extensive experiments on three face editing tasks, including face aging, makeup, and expression editing, show that the proposed UGAN can produce superior results over the state-of-the-art models. The source code will be released.
Multi-domain image-to-image translation  refers to image translation among multiple domains, where each domain is characterized by different attributes. For example, the face aging task, with age groups as domains, aims to translate a given face into other age groups using a single translator. As shown in Figure 1 Row 1, the input face image is translated into different age groups.
Although prior works [2, 33, 10] have made significant progress, the translated results still suffer from retaining the characteristics of the source domain (incompatible with the target domain), which is the so-called phenomenon of source retaining. As illustrated in Figure 1 Row 1, when StarGAN translates a female face from the age group 3140 to 010, the translated image still looks like an adult. In makeup editing shown in Figure 1 Row 2, StarGAN fails to eliminate the eye shadows in makeup removing. For expression editing, as shown in Figure 1 Row 3, the results of StarGAN show visible teeth shadows around the mouth region.
The reason for phenomenon of source retaining is that the explicit and effective mechanisms to erase the characteristics of the source domain have not been explored in the prior works. Most of them just simply apply a domain classifier, which is only trained to recognize the domain class of real data, to guide the image translation. However, the domain classifier is not sensitive to the non-qualified synthesized image containing incompatible characteristics. As shown in Figure 2 Row 1, the discriminator correctly judges an adult face to be within 3140 age group. Translating the adult face into a child face (010), the translated face heavily retains adult characteristics, e.g., beard and expression wrinkles (Figure 2 Row 2). However, the discriminator judges it to be within 010 age group with the confidence of . That is, the synthesized image containing incompatible characteristics has almost no punishment from the domain classifier, which results in the phenomenon of source retaining.
To tackle the problem of source retaining, we propose a new method untraceable GAN (UGAN), which introduces untraceable constraint and prototype injection. The untraceable constraint is employed to encourage the translator to erase all the source-only characteristics and synthesize certain target-only ones. As shown in Figure 2, the process of an image from 3140 years old (source domain) translated to 010 years old (target domain), the beard and wrinkles (source-only characteristics) need to be erased, while a smooth skin and round face (target-only characteristics) should be synthesized. To endow the proposed UGAN with the above capabilities, a discriminator is trained to track which domain the synthesized image is translated from, while the translator is trained to make the source domain of the synthesized image being untraceable for the discriminator. Furthermore, To effectively synthesize the target-only characteristics, we take the prototype  of the target domain as the guidance for the translator. The prototype is a statistic of the target domain, which aims to provide the essential characteristics, like the round face of 010 years old domain.
Our contributions include:
To the best of our knowledge, this is the first work to present the phenomenon of source retaining in multi-domain image-to-image translation, and propose a novel UGAN to explicitly erase the characteristics of the source domain for improving the image translation.
A novel source classifier is introduced to differentiate which domain an image is translated from, and determines whether the translated image still retains the characteristics of the source domain.
The propose UGAN is the first work to take the target prototype into the translator for synthesizing the target domain characteristics.
Extensive qualitative and quantitative experiments are conducted for three face editing tasks that demonstrate the superiority of our proposed UGAN.
2 Related Work
In this section, we give a brief review on three aspects related to our work: Generative Adversarial Network, Conditional GANs and Image-to-Image Translation.
Generative Adversarial Networks (GANs)  are popular generative models that employ adversarial learning between a generator and discriminator to synthesize the realistic data, which have gained astonishing successes in many computer vision tasks, such as image-to-image translation , domain adaptation  and super-resolution . In this work, the proposed UGAN enjoys the adversarial learning [1, 8], which approximately minimizes the Wasserstein distance between the synthesized distribution and real distribution.
Conditional GANs  are variants of GANs, which aim to controllably synthesize examples under the given condition. Many prior works focus on generating samples under different forms of conditions, such as category label in the form of one-hot code  or learnable parameters , and text with word embedding , etc. Different from these works, for synthesizing the required characteristics, we introduce the prototype of the condition to provide prior information, where the prototype is one of the statistics of the target domain.
Image-to-Image Translation is first defined in pix2pix , which is improved from various aspects, such as skip connection for maintaining useful original information [30, 18, 24], cascade training from coarse to fine [27, 3], extra relevant data , buffer of history fake image , multi-discriminator , 3D technology , variational sampling [35, 6]. If the translator only models directed translation between two domains, translators are required among domains. A single conditional translator for multi-domain translation is seriously demanded. Thus we focus on multi-domain translation with such a single translator. The current multi-domain image translation methods [2, 33, 10] using the vanilla one-hot condition for the translator, without considering the information contained in each domain. We are the first to adopt the statistics of each domain as a condition of the translator to efficiently inject the essential characteristics. Furthermore, the prior methods apply the domain classifier for condition constraints. However, limited by the classifier, they often suffer from the phenomenon of source retaining. Thus, we change the role of this auxiliary classifier in UGAN and make it classify which source domain the given datum is translated from, instead of classifying which domain the given datum is sampled from.
3 Our Approach
The framework of UGAN is shown in Figure 3. The input image and the target condition are fed into the translator . The discriminator has two heads: one head is named as the authenticity classifier to distinguish whether the input sample is real or fake; the other is called the source classifier, aiming to determine which domain the sample is translated from, where the real data are supposed to be translated from themselves. For erasing source-only characteristics and synthesizing the target-only characteristics, translator is trained to fool the source classifier of to believe that the synthesized image is translated from the target domain. Moreover, to effectively synthesize the target characteristics, we introduce the “prototype” of the target domain and inject it into the translated image.
For convenience, we then introduce the used mathematical annotations. Discriminator here contains two heads including the authenticity classifier and the source classifier , where and share the same feature extraction module . and are abbreviated as and respectively. is a sample pair from the source domain, where represents the image and is its label. By feeding the image and the target label into , it produces . We use to denote the joint distribution of image and domain label . and are the marginal distribution of images and labels, respectively.
3.1 Untraceable Constraint
To tackle the problem of source retaining, the source classifier is trained to classify which domain image is translated from. For an real image-label pair , we regard as translated from domain to domain . Since aims to classify where an image is translated from, the real datum should be classified into , meaning is translated from domain . The synthesized image should be classified into , meaning is translated from . Translator is trained to fool to classify into the . In this way, is trained to make the source domain of is untraceable and the target domain characteristics are injected to . The adversarial training is formulated as follows:
where is the penalty coefficient of source retaining. For space limit, and are abbreviated as and , respectively.
Note that the should be injected with certain target-only characteristics. Recall that in Eq. (2), is trained to fool to classify into . However, the class here is not pure that mixed with the characteristics of and synthesized data . Refer to Eq. (1), the source classifier treats the real sample sampled from and fake sample translated from as the same class . To accurately synthesize the characteristics of the target domain, the number of categories of is augmented as . The first categories are real data and those sampled (translated) from the corresponding domain. The latter categories are fake data, and those translated from the corresponding domain. means input datum is fake and translated from . In addition, the translator is trained to fool to classify into the category. The untraceable constraint conducted via optimizing the following:
In this process, is trained to identify whether is a fake image and the source domain. is trained to approximate the true untraceable translator.
3.2 Prototype Injection
The statistics of the target domain can provide guidance information for image translation. Refer to Figure 3 (b), the average image of each age group shows the essential characteristics, like round face and flat nose characteristics of age group 1 (010). Thus, We leverage the statistics of the target domain to further inject the essential characteristics of the target domain into the translated image, where we call the statistic, containing the essential characteristics, as the “prototypes” following the classic aging method . However, the posture of the source image and target prototype may be misaligned. Thus, concatenating or summing up the image feature and prototype feature will hurt the performance. To naturally inject these essential characteristics, we design an adaptive prototype injection (API) module inspired by non-local operation [28, 26].
Refer to Figure 3 (c), the injection process of API is formulated as follows:
where and are the feature maps of the source image and target prototype, respectively. is the index of an feature map position. is a linear mapping to reduce the dimension. Since the computation of the correlation matrix is computationally expensive, we apply the API module on the low-resolution feature maps. To simultaneously maintain resolution and inject the prototype, the translator is designed with two parallel networks, with parameter sharing at both ends (gray color, Figure 3). For maintaining the resolution, one network is a common architecture in image translation . The other one applies the API module on the low-resolution feature maps. Finally, the outputs of these two networks are fused by element-wise sum to generate the translated image.
3.3 Objective Function
Authenticity constraint: The adversarial loss of WGAN-gp  is adopted to constrain the synthetic joint distribution to approximate the real distribution.
where , and . The third term in Eq. (6) is a gradient penalty term that enforces the discriminator as a 1-Lipschitz function.
Cycle Consistency: The input and output are regularized to satisfy the correspondence :
Overall loss function: and are trained by optimizing
|Method||Age Group Gap|
Face aging dataset is collected by C-GAN  including face images. Ages are divided into age groups including , , , , , and . of the dataset is randomly selected as the test set, and the rest is the training set.
MAKEUP-A5 is a makeup-labeled dataset  containing aligned Asian woman faces with makeup categories including retro, Korean, Japanese, non-makeup and smoky. The training set contains images and the remaining is the test set.
CFEE is an expression dataset  of expressions with images. The categories of facial expressions include (A) neutral, (B) happy, (C) sad, (D) fearful, (E) angry, (F) surprised, (G) disgusted, (H) happily surprised, (I) happily disgusted, (J) sadly fearful, (K) sadly angry, (L) sadly surprised, (M) sadly disgusted, (N) fearfully angry, (O) fearfully surprised, (P) fearfully disgusted, (Q) angrily surprised, (R) angrily disgusted, (S) disgustedly surprised, (T) appalled, (U) hatred and (V) awed. We randomly select identities ( images) as the test set and use the other images for training. All images are aligned and resized to resolution.
Intra FIDs [11, 22, 4] on each domain and mean of them are used for evaluation. FID is a common quantitative measure for generative models, which measures the 2-Wasserstein distance between the two distributions and on the features extracted from InceptionV3 model. It is defined as 
where and are feature distributions of real data and synthesized data, and are the mean and the covariance of and . The mean intra FID is calculated by
where is the domain label for the total domains.
User studies by Amazon Mechanical Turk (AMT): Given an input image, target domain images translated by different methods are displayed to the Turkers who are asked to choose the best one.
Cosine similarity: For the face aging task, cosine similarity between the features of real images and the corresponding translated images is used to measure the degree of source retaining. Features are extracted by a ResNet-18 model  trained on the same training set.
4.3 Implementation Details
We perform experiments with three versions of our methods named as UGAN, UGAN and UGAN, where the methods with superscripts ( and ) mean adopting the same translator as StarGAN (without prototype), “UGAN” means adopting as untraceable constraint, while “UGAN” adopting . “UGAN” means the final method that adopting as an untraceable constraint and the proposed translator with an API module. For a fair comparison, our learning rate is fixed as , while the other hyper-parameters are kept the same as StarGAN. All experiments are optimized by Adam with and . The discriminator is iterated times per iteration of the translator. All baselines and our methods are trained epochs. The mini-batch size is set to . All images are horizontally flipped with a probability of as data augmentation.
Baselines: StarGAN  has shown the best performance than DIAT , CycleGAN  and IcGAN . We, therefore, select StarGAN as our baseline to verify the superiority of our method. For the face aging task, we additionally compare two classic GAN-based methods of face aging, including CAAE  and C-GAN (without transition pattern network) .
4.4 Quantitative Experiments
Given the domain label , we traverse all images in the test set to generate fake images. All the synthetic images of each domain are adopted to calculate intra FID, while synthetic images of each domain are randomly sampled to be evaluated by AMT.
Face aging: The comparison of results on face aging dataset is shown in Table 2. Face aging involves deformations and texture synthesis. For example, deformation, such as the face shape and eye size, are the main differences between babies and adults. Texture synthesis, like adding wrinkles, is also essential when translating a middle-aged man to a senior man. In Table 2, both UGAN and UGAN are significantly better than StarGAN on all age groups, where UGAN are better than UGAN. The mean intra FID drops from (StarGAN) to (UGAN). The relative drop is more than . Furthermore, UGAN achieves the best performance with mean intra FID .
Makeup editing: The comparison of results on MAKEUP-A5 dataset is shown in Table 3. Both texture and color need to be altered in makeup editing. UGAN has the best performance in all categories. The mean intra FID has declined from (StarGAN) to (UGAN).
Expression editing: The comparisons on CFEE dataset are shown in Table 1. The expression editing task aims to change the emotion of a face by deformation. The CFEE dataset contains kinds of fine-grained expressions, which makes the expression editing problem very challenging. From the results, we can conclude that UGAN again achieves the best performance. The mean intra FID is (StarGAN), (UGAN), (UGAN), and (UGAN), respectively. It can be seen that the reduction is significant.
AMT user studies: For further evaluation, user studies are conducted on AMT 111https://www.mturk.com/ to compare StarGAN and our method. Since UGAN outperforms UGAN and UGAN for mean intra FID, only UGAN is compared. With datasets mentioned above, we synthesize pairs of images per domain by UGAN and StarGAN. All image pairs are shown to Turkers who are asked to choose the better one considering image realism and satisfaction of target characteristics. Table 4, 5 and 6 show the percentage of our method beating StarGAN. For example, in Table 5, when changing a face to years old, StarGAN wins in cases while our method wins in cases. It again shows the advantages of our method when transforming a face into childhood. Generally, our method is better than StarGAN in every category of each dataset.
Tackling the phenomenon of source retaining: The effect of erasing source characteristics on face aging is shown in Table 7. A well-trained ResNet-18 (for age recognition) is adopted to extract features (the second last layer). We calculate average cosine similarity on the neural feature of all source images and translated image pairs from the test set. Intuitively, the smaller the similarity, the more thoroughly source characteristics are erased. Since the images of adjacent age groups are similar, we only consider translation across a large age gap, e.g., across three age groups. In Table 7, we perform the experiments on multiple age group gaps, and the similarities of UGAN are smaller on all age group gaps.
4.5 Qualitative Experiments
Face aging: Results on the face aging dataset are shown in Figure 4. In the first example, an input image is a woman. By comparing the results of years old (second column), our result has obvious childish characteristics, e.g. round face, big eyes, and small nose, while the result of StarGAN does not look like a child. Another example is the years old case (last column). Our result has white hair, wrinkles, while StarGAN produces a middle-aged face. These results show that UGAN can explicitly erase the characteristics of the source image by the source classifier in the discriminator.
Makeup editing: Two exemplary results on MAKEUP-A5 dataset are displayed in Figure 5. For the first woman, by comparing the results of the second (retro) and last (smoky) columns, we find that blusher and eye shadows of UGAN are more natural, while StarGAN draws asymmetrical blusher and strange eye shadows. The result of UGAN is relatively natural when translating it to a non-makeup face. Therefore, we conclude that UGAN has learned the precise color and texture characteristics of different makeups.
Expression editing: Results on CFEE dataset are demonstrated in Figure 6. We have the following observations. First, UGAN can well edit kinds of fine-grained facial expressions. Also, UGAN captures the subtle differences between basic and compound expressions. For example, “Happily surprised” has bigger eyes and raising eyebrows compared to “Happy”. Besides, the results of StarGAN under various expressions still retain the original expressions. For example, when changing the man from “Hatred” to “Happy”, the result of StarGAN still has tight brows. Comparatively, UGAN can effectively synthesize the “Happy” expression by generating a grin and relaxed brows and erasing the tight brows.
The phenomenon of source retaining often occurs in the image-to-image translation task. To address it, the Untraceable GAN (UGAN) model has been proposed, where the discriminator estimates the source domain. The translator is trained to fool the discriminator to believe that the generated data is translated from the target domain. In this way, the source domain of the synthesized image is untraceable. In addition, we have further presented the prototype of each domain and inject it into the translated image to generate the target characteristics. Extensive experiments on three tasks have proven the significant advantages of our method over the state-of-the-art StarGAN.
The source retaining phenomenon is common in various fields, where the UGAN idea may be widely used to alleviate the issue. For example, language translation  often preserves the grammatical structure of the source language. UGAN may serve as a solution to improve translation quality. Furthermore, the prototype injection idea also can be introduced to the universal conditional generation. We plan to study these ideas in-depth and apply them to broader applications.
-  (2017) Wasserstein gan. arXiv:1701.07875. Cited by: §2.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §1, §1, §2, §4.3.
-  (2018) Sparse, smart contours to represent and edit images. In CVPR, Cited by: §2.
-  (1982) The fréchet distance between multivariate normal distributions. MA. Cited by: §4.2.
-  (2014) Compound facial expressions of emotion. PNAS. Cited by: §4.1.
-  (2018) A variational u-net for conditional appearance and shape generation. In CVPR, Cited by: §2.
-  (2014) Generative adversarial networks. In NIPS, Cited by: §2.
-  (2017) Improved training of wasserstein gans. In NIPS, Cited by: §2, §3.3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.
-  (2019) Attgan: facial attribute editing by only changing what you want. IEEE Transactions on Image Processing. Cited by: §1, §2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: §4.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2, §2.
-  (2014) Illumination-aware age progression. In CVPR, Cited by: §1, §3.2.
-  (2018) Unsupervised machine translation using monolingual corpora only. ICLR. Cited by: §5.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Cited by: §2.
-  (2016) Deep identity-aware transfer of facial attributes. arXiv:1610.05586. Cited by: §4.3.
-  (2018) BeautyGAN: instance-level facial makeup transfer with deep generative adversarial network. In MM, Cited by: §4.1.
-  (2017) Face aging with contextual generative adversarial nets. In MM, Cited by: §2, §4.1, §4.3.
-  (2018) Cross-domain human parsing via adversarial feature and label adaptation. arXiv:1801.01260. Cited by: §2.
-  (2014) Conditional generative adversarial nets. arXiv:1411.1784. Cited by: §2.
-  (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957. Cited by: §2.
-  (2018) CGANs with projection discriminator. arXiv:1802.05637. Cited by: §4.2.
-  (2016) Invertible conditional gans for image editing. arXiv:1611.06355. Cited by: §4.3.
-  (2018) Ganimation: anatomically-aware facial animation from a single image. In ECCV, Cited by: §2.
-  (2017) Learning from simulated and unsupervised images through adversarial training.. In CVPR, Cited by: §2.
-  (2017) Attention is all you need. In NIPS, Cited by: §3.2.
-  (2017) High-resolution image synthesis and semantic manipulation with conditional gans. arXiv:1711.11585. Cited by: §2.
-  (2018) Non-local neural networks. In CVPR, Cited by: §3.2.
-  (2018) 3D-aware scene manipulation via inverse graphics. arXiv:1808.09351. Cited by: §2.
-  (2018) Generative adversarial network with spatial attention for face attribute editing. In ECCV, Cited by: §2.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint. Cited by: §2.
-  (2017) Age progression/regression by conditional adversarial autoencoder. In CVPR, Cited by: §4.3.
-  (2018) Modular generative adversarial networks. In ECCV, Cited by: §1, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:1703.10593. Cited by: §3.2, §3.3, §4.3.
-  (2017) Toward multimodal image-to-image translation. In NIPS, Cited by: §2.
Appendix A Network Architecture
Appendix B Prototype
Appendix C Qualitative Results
Face aging: the results on face aging dataset are shown in Figure 12 and 13. In the Figure 12, the women images are used as input and synthesized faces of seven age groups are shown in second to seventh columns. Observing the second and last columns, our method generates very realistic results. For example, in the sixth row and fourth column of Figure 12, the woman is successfully transformed into a child with baby teeth, big eyes, etc. For another example, in the woman is aged to a senior woman with white hair and wrinkles. Similar conclusions can be drawn by taking men as input as shown in 13. For example, in , the beard of the translated images become increasingly thicker.
Makeup editing: exemplar results of StarGAN and UGAN on MAKEUP-A5 are displayed in Figure 14 and 15 respectively. Observing the images of the fifth column, all makeup can be removed to be a naked face. By observing the others columns, the makeup results of our method correspond to the specified categories. For example, in of Figure 14, the translated face belongs to “Retro” with pink blush, lipstick, eye shadow. For another example, in of Figure 14, the translated face belongs to “Smoky” with black eyeliner and eye shadow.
Expression editing: exemplar results of expression editing on CFEE are demonstrated in Figure 16 and 17 respectively. Our method is able to edit kinds of fine-grained facial expression well. For example, for the image in the second row of Figure 17, when translating it to “happy”, our method successfully synthesizes the real teeth and accurately expresses the happy expression. Our method also can vividly synthesize other expressions.