Breaking the cycle—Colleagues are all you need
This paper proposes a novel approach to performing image-to-image translation between unpaired domains. Rather than relying on a cycle constraint, our method takes advantage of collaboration between various GANs. This results in a multi-modal method, in which multiple optional and diverse images are produced for a given image. Our model addresses some of the shortcomings of classical GANs: (1) It is able to remove large objects, such as glasses. (2) Since it does not need to support the cycle constraint, no irrelevant traces of the input are left on the generated image. (3) It manages to translate between domains that require large shape modifications. Our results are shown to outperform those generated by state-of-the-art methods for several challenging applications on commonly-used datasets, both qualitatively and quantitatively.
Mapping between different domains is inline with the human ability to find similarities between features in distinctive, yet associated, classes. Therefore it is not surprising that image-to-image translation has gained a lot of attention in recent years. Many applications have been demonstrated to benefit from it, yielding beautiful results.
In unsupervised settings, where no paired data is available, shared latent space and cycle-consistency assumptions have been utilized [Anoosheh2017ComboGANUS, StarGAN, Chu2017, Hua2017UnsupervisedCI, Huang2018MultimodalTranslation, kim2017, Liu2017, royer2017xgan, YiDualGAN, Zhu2017UnpairedNetworks]. Despite the successes & benefits, previous methods might suffer from some drawbacks.
In particular, oftentimes, the cycle constraint might cause the preservation of source domain features, as can be seen for example, in Figure LABEL:fig:teaser(c), where facial hair remains on the faces of the women. This is due to the need to go back and forth through the cycle. Second, as discussed in [DBLP:journals/corr/abs-1907-10830], sometimes the methods are unsuccessful for image translation tasks with large shape change, such as in the case of the anime in Figure LABEL:fig:teaser(b). Finally, as explained in [siddiquee2019learning],
it is still a challenge to completely remove large objects, like glasses, from the images, and therefore this task is left for their future work (Figure LABEL:fig:teaser(a)).
We propose a novel approach, termed Council-GAN, which handles these challenges. The key idea is to rely on ”collegiality” between GANs, rather than utilizing a cycle. Specifically, instead of using a single pair of a generator/discriminator ”experts”, it utilizes the collective opinion of a group of pairs (the council) and leverages the variation between the results of the generators.
This leads to a more stable and diverse domain transfer.
To realize this idea, we propose to train a council of multiple council members, requiring them to learn from each other. Each generator in the council gets the same input from the source domain and will produce its own output. However, the outputs produced by the various generators should have some common denominator. For this to happen across all images, the generators have to find common features in the input, which are used to generate their outputs. Each discriminator learns to distinguish between the generated images of its own generator and images produced by the other generators. This forces each generator to converge to a result that is agreeable by the others. Intuitively, this convergence assists to maximize the mutual information between the source domain and the target domain, which explains why the generated images maintain the important features of the source images.
We demonstrate the benefits of our approach for several applications, including glasses removal, face to anime translation, and male to female translation. In all cases we achieve state-of-the-art results.
Hence, this paper makes the following contributions:
We introduce a novel model for unsupervised image-to-image translation, whose key idea is collaboration between multiple generators. Conversely to most recent methods, our model avoids cycle-consistency constraints altogether.
Our council manages to achieve state-of-the-art results in a variety of challenging applications.
2 Related work
Generative adversarial networks (GANs).
Since the introduction of the GAN framework [Goodfellow2014], it has been demonstrated to achieve eye-pleasing results in numerous applications. In this framework, a generator is trained to fool a discriminator, whereas the latter attempts to distinguish between the generated samples and real samples.
A variety of modifications have been proposed in recent years in an attempt to improve GAN’s results; see [Arjovsky2017WassersteinNetworks, Denton2015DeepNetworks, dosovitskiy2016generating, Huang2016StackedNetworks, Karras2018ProgressiveVariation, Mao2017LeastNetworks, RadfordDCGANUNSUPERVISEDNETWORKS, Rosca2017VariationalNetworks, Salimans2016ImprovedGANs, Tolstikhin2018WassersteinAuto-Encoders, ZhangStackGAN:Networks] for a few of them.
We are not the first to propose the use of multiple GANs [Durugkar2016GenerativeNetworks, Ghosh2018Multi-agentNetworks, QuanHoangTuDinhNguyenTrungLe2018, Juefei-Xu2017GangRanking]. However, previous approaches differ from ours both in their architectures and in their goals. For instance, some of previous architectures consist of multiple discriminators and a single generator; conversely, some propose to have a key discriminator that can evaluate the generators’ results and improve them. We propose a novel architecture to realize the concept of a council, as described in Section 3. Furthermore, the goal of other approaches is either to push each other apart, to create diverse solutions, or to improve the results. Our council attempts to find the commonalities between the the source and target domains. By requiring the council members to ”agree” on each other’s results, they in fact learn to focus on the common traits of the domains.
The aim is to learn a mapping from a source domain to a target domain. Early approaches adopt a supervised framework, in which the model learns paired examples, for instance using a conditional GAN to model the mapping function [Isola2016Image-to-ImageNetworks, wang2018pix2pixHD, Zhu2017].
Recently, numerous methods have been proposed, which use unpaired examples for the learning task and produce highly impressive results; see for example [Berthelot2017BEGAN:Networks, Gatys2016, Huang2018MultimodalTranslation, kim2017, Lee_2018_ECCV, Liu2017, siddiquee2019learning, Zhu2017UnpairedNetworks], out of a truly extensive literature. This approach is vital to applications for which paired data is unavailable or difficult to gain. Our model belongs to the class of GAN models that do not require paired training data.
A major concern in the unsupervised approach is the type of properties of the source domain that should be preserved. Examples include pixel values [Shrivastava2017LearningTraining], pixel gradients [bousmalis2017unsupervised], pairwise sample distances [benaim2017one], and recently mostly cycle consistency [kim2017, YiDualGAN, Zhu2017UnpairedNetworks]. The latter enforces the constraint that translating an image to the target domain and back, should obtain the original image. Our method avoids using cycles altogether. This has the benefit of bypassing unnecessary constraints on the generated output, and thus avoiding to preserve hidden information [Chu2017].
Finally, most existing methods lack diversity in the results. To address this problem, some methods propose to produce multiple outputs for the same given image [Huang2018MultimodalTranslation, Lee_2018_ECCV]. Our method enables image translation with diverse outputs, however it does so in a manner in which all GANs in the council ”acknowledge” to some degree each other’s output.
This section describes our proposed model, which addresses the drawbacks described in Section 1. Our model consists of a set, termed a council, whose members influence each other’s results. Each member of the council has one generator and a couple of discriminators, as described below. The generators need not converge to a specific output; instead, each produces its own results, jointly generating a diverse set of results. During training, they take into account the images produced by the other generators. Intuitively, the mutual influence enforces the generators to focus on joint traits of the images in the source domain, which could be matched to those in the target domain. For instance, in Figure LABEL:fig:teaser, to transform a male into a female, the generators focus on the structure of the face, on which they can all agree upon. Therefore, this feature will be preserved, which can explain the good results.
Furthermore, our model avoids cycle constraints. This means that there is no need to go in both directions between the source domain and the target domains. As a result, there is no need to leave traces on the generated image (e.g. glasses) or to limit the amount of change (e.g. anime).
To realize this idea, we define a council of members as follows (Figure 1). Each member of the council is a triplet, whose components are a single generator and two discriminators & , . The task of discriminator is to distinguish between the generator’s output and real examples from the target domain, as done in any classical GAN. The goal of discriminator is to distinguish between images produced by and images produced by the other generators in the council. This discriminator is the core of the model and this is what differentiates our model from the classical GAN model. It enforces the generator to converge to images that could be acknowledged by all council members—images that share similar features.
The loss function of is the classical adversarial loss of [Mao2017LeastNetworks]. Hereafter, we focus on the loss function of , which makes the outputs of the various generators share common traits, while still maintain diversity. At every iteration, gets as input pairs of (input,output) from all the generators in the council. Rather than distinguishing between real & fake, ’s distinguishes between the result of ”my-generator” and the result of ”another-generator”. Hence, during training, attempts to minimize the distance between the outputs of the generators. Note that getting the input and not only the output is important to make the connection, for each pair, between the features of the source image and those of the generated image.
Let be the source domain and be the target domain. In our model we have mappings . Given an image , a straightforward adaptation of the classical adversarial loss to our case would be:
where tries to generate images that look similar to images from domains for . In analogy to the classical adversarial loss, in Equation (1), both terms should be minimized, where the left term learns to ”identify” its corresponding generator as ”fake” and the right term learns to ”identify” the other generators as ”real”.
|(a) Council discriminator||(b) GAN discriminator|
To allow multimodal translation, we encode the input image, as illustrated in Figure 2, which zooms into the structure of the generator [Huang2018MultimodalTranslation]. The encoded image should carry useful (mutual) information between domains and . Let be the encoder for the source image and let be the random entropy vector, associated with the member of the council, . enables each generator to generate multiple diverse results. Equation (1) is modified so as to get an encoded image (instead of the original input image) and the random entropy vector. The loss function of is then defined as:
Here, the loss function gets, as additional inputs, all the encoders and vector . controls the size of the sub-domain of the other generators, which is important in order to converge to ”acceptable” images.
Figure 3 illustrates the differences and the similarities between discriminators and . Both should distinguish between the generator’s results and other images; in the case of the other images are real images from the target domain, whereas in the case of , they are images generated by other generators in the council. Another fundamental difference is their input: gets not only the generator’s output, but also its input. This aims at producing a resulting image that has common features with the input image.
For each member of the council, we jointly train the generator (assuming the encoder is included) and the discriminators to optimize the final objective. In essence, , , & play a three-way min-max-max game with a value function :
This equation is a weighted sum of the adversarial loss (of ), as defined in [Mao2017LeastNetworks], and the (of ), as defined in Equation (2). controls the importance of looking more ”real” or more inline with the other generators. High values will result in more similar images, whereas low values will require less agreement and result in higher diversity between the generated images.
For some applications, it is preferable to focus on specific areas of the image and modify only them, leaving the rest of the image untouched. This can be easily accommodated into our general scheme, without changing the architecture.
The idea is to let the generator produce not only an image, but also an associated focus map, which essentially segments the learned objects in the domain from the background. All that is needed is to add a fourth channel, , to the generator, which would generate values in the range . These values can be interpreted as the likelihood of a pixel to belong to the background (or to an object). To realize this, Equation (3) becomes
In Equation (3), is the value of the channel for pixel . The first term attempts to minimize the size of the focus mask, i.e. make it focus solely on the object. The second term is in charge of segmenting the image into an object and a background ( or ). This is done in order to avoid generating semi-transparent pixels. In our implementation . The result is normalized by the image size. The values of and are application-dependent and will be defined for each application in Section 5.
Figure 4 illustrates the importance of the various losses.
If only the (jointly with the ) is used, the faces of the input and the output are completely unrelated, though the quality of the images is good and the background does not change in most cases. Using only the , the faces of the input and the output are nicely related, but the background might change. Our loss, which combines the above losses, produces the best results.
We note that this idea of adding a channel, which makes the generator focus on the proper areas of the image, can be used in other GAN architectures. It is not limited to our proposed council architecture.
4.1 Experiment setup
We applied our council GAN to several challenging image-to-image translation tasks (Section 4.2).
Baseline models. Depending on the application, we compare our results to those of some state-of-the-art models, including CycleGAN [Zhu2017UnpairedNetworks], MUNIT [Huang2018MultimodalTranslation], DRIT++ [Lee_2018_ECCV, DRIT_plus], U-GAT-IT [DBLP:journals/corr/abs-1907-10830], StarGAN [StarGAN], Fixed-PointGAN [siddiquee2019learning]. These methods are unsupervised and use cycle constraints. Out of these methods, MUNIT [Huang2018MultimodalTranslation] and DRIT++ [Lee_2018_ECCV, DRIT_plus] are multi-modal and generate several results for a given image. The others produce a single result.
Datasets. We evaluated the performance of our system on the following datasets.
CelebA [liu2015faceattributes]. This dataset contains face images of celebrities, each annotated with binary attributes. We focus on two attributes: (1) the gender attribute and (2) with/without glasses attribute. The training dataset contains (/) images of males (/with glasses) and (/) images of females (/without glasses). The test dataset consists of (/) males (/with glasses) and (/) females (/without glasses).
selfie2anime [DBLP:journals/corr/abs-1907-10830]. The size of the training dataset is selfie images and anime images. The size of the test dataset is selfie images and anime images.
Training. All models were trained using Adam [kingma2014adam] with and . For data augmentation we flipped the images horizontally with a probability of .
For the selfie/anime dataset , where the number of images is small, we augmented the data also with color jittering with up to , random Grayscale with a probability of , random Rotation with up to , random translation of up to of the image, and with random perspective with distortion scale of with a probability of . On the last iterations we trained only on the original data, without augmentation. We performed one generator update after a number of discriminator updates that is equal to the size of the council. The batch size was set to for all experiments. We trained all models with a learning rate of , where the learning rate drops by a factor of after every iterations.
Evaluation. We verify our results both qualitatively and quantitatively. For the latter, we use two common measures:
The Frechet Inception Distance score (FID) [heusel2017gans] calculates the distance between the feature vectors of the real and the generated images.
Kernel Inception Distance (KID) [binkowski2018demystifying], which improves on FID and measures GAN convergence.
4.2 Experimental results
Experimental results for male-to-female translation. Given an image of a male face, the goal is to generate a female face, which resembles the male face [Almahairi2018AugmentedData, lu2018attribute]. As explained in [Almahairi2018AugmentedData], three features make this translation task challenging: (i) There is no predefined correspondence in real data of each domain. (ii) The relationship is many-to-many between domains, as many male-to-female mappings are possible. (iii) Capturing realistic variations in generated faces requires transformations that go beyond simple color and texture changes.
Figure 5 shows our results, generated by a council of four members, and compares them to those of [StarGAN, Huang2018MultimodalTranslation, DRIT_plus, Zhu2017UnpairedNetworks]. Note that each of the council member may generate multiple results, depending on the random entropy vector. We observe that our generated females are more ”feminine” (e.g., the beards completely disappear and the haircuts are longer), while still preserving the main features of the source male face and resemble it. This can be attributed to the fact that we do not use a cycle to go from a male to a female and back, and thus we do not need to preserve any masculine features. More examples are given in the supplementary materials
Table 1 summarized our quantitative results, where our results are randomly chosen from those generated by the different members of the council. Our results outperform those of other methods in both evaluation metrics.
|DIRT++ [DRIT_plus, Lee_2018_ECCV]||26.24||0.0016|
Experimental results for selfie-to-anime translation. Given an image of a human face, the goal is to generate an appealing anime, which resembles the human. This is a challenging task, as not only the style differs, but also the geometric structure of the input and the output greatly varies (e.g. the size of the eyes). This might lead to mismatching of the structures, which would lead to distortions and visual artifacts. This difficulty is added to the three challenges mentioned in the previous application: the lack of predefined correspondence of the domains, the many-to-many relationship, and going beyond color and texture.
Figure 6 shows our results using a council of four and compares them to those of [Huang2018MultimodalTranslation, DBLP:journals/corr/abs-1907-10830, Lee_2018_ECCV, DRIT_plus, Zhu2017UnpairedNetworks]. Our generated anime images are quite often better resemble the input in terms of expression and face structure (i.e., the shape of the chin). This can be explained by the fact that it is easier for the council members to ”agree” on features that exist in the input.
Table 2 shows quantitative results. It can be seen that our results outperform or are competitive with those of other methods in both evaluation metrics.
|DIRT++ [DRIT_plus, Lee_2018_ECCV]||109.22||0.0020|
Experimental results for glasses removal. Given an image of a person with glasses, the goal is to generate an image of the same person, but with the glasses removed. While in the previous application, the whole image changes, here the challenge is to modify only a certain part of the face and leave the rest of the image untouched.
Figure 7 shows our results when using a council of four and compares them to those of [siddiquee2019learning], which shows results for this application, as well as to [Zhu2017UnpairedNetworks]. Our generated images leave considerably less traces of the removed glasses. Again, this can be attributed lack of the cycle constraint.
Table 3 provides quantitative results. For this application as well, our council manages to outperform other methods and address the challenge of removing large objects.
|Fixed-point GAN [siddiquee2019learning]||55.26||0.0041|
Our code is based on PyTorch; it will be provided as an open-source upon acceptance. We set the major parameters as follows: , which controls diversity (Equation (2)), is set to . , which controls the size of the mask (Equation (3)), is set to . and from Equation (4) are set according to the applications: in male to female & ; in selfie to anime & ; in glasses removal & .
Figure 8 studies the influence of the number of members and the number of iterations on the quality of the results. We focus on the male-to-female application, which is representative. The fewer the number of members in the council, the faster the convergence is. However, this comes at a price: the accuracy is worse. Furthermore, it can be seen that the KID improves with iterations, as expected.
Figure 9 demonstrates a limitation of our method. When removing the glasses, the face might also become more feminine. This is attributed to the imbalance inherent to the dataset. Specifically, the ratio of the men to women with glasses is , whereas the ratio of men to women without glasses is only . The result of this imbalance in the target domain is that removing glasses also means becoming more feminine. This problem can be solved by providing a dataset with an equal number of males and females with and without glasses. Handling feature imbalance without changing the number of images in the dataset, is an interesting direction for future research.
|(a) glasses removal||(a) male to female|
This paper introduces the concept of a council of GANs—a novel approach to perform image-to-image translation between unpaired domains. They key idea is to replace the widely-used cycle-consistency constraint by leveraging collaboration between GANs. Council members assist each other to improve, each its own result.
Furthermore, the paper proposes an implementation of this concept and demonstrates its benefits for three challenging applications. The members of the council generate several optional results for a given input. They manage to remove large objects from the images, not to leave redundant traces from the input and to handle large shape modifications. The results outperform those of SOTA algorithms both quantitatively and qualitatively.