Breaking the cycle—Colleagues are all you need

Breaking the cycle—Colleagues are all you need

Ori Nizan
Technion, Israel
   Ayellet Tal
Technion, Israel

This paper proposes a novel approach to performing image-to-image translation between unpaired domains. Rather than relying on a cycle constraint, our method takes advantage of collaboration between various GANs. This results in a multi-modal method, in which multiple optional and diverse images are produced for a given image. Our model addresses some of the shortcomings of classical GANs: (1) It is able to remove large objects, such as glasses. (2) Since it does not need to support the cycle constraint, no irrelevant traces of the input are left on the generated image. (3) It manages to translate between domains that require large shape modifications. Our results are shown to outperform those generated by state-of-the-art methods for several challenging applications on commonly-used datasets, both qualitatively and quantitatively.

1 Introduction

Mapping between different domains is inline with the human ability to find similarities between features in distinctive, yet associated, classes. Therefore it is not surprising that image-to-image translation has gained a lot of attention in recent years. Many applications have been demonstrated to benefit from it, yielding beautiful results.

In unsupervised settings, where no paired data is available, shared latent space and cycle-consistency assumptions have been utilized [Anoosheh2017ComboGANUS, StarGAN, Chu2017, Hua2017UnsupervisedCI, Huang2018MultimodalTranslation, kim2017, Liu2017, royer2017xgan, YiDualGAN, Zhu2017UnpairedNetworks]. Despite the successes & benefits, previous methods might suffer from some drawbacks.

In particular, oftentimes, the cycle constraint might cause the preservation of source domain features, as can be seen for example, in Figure LABEL:fig:teaser(c), where facial hair remains on the faces of the women. This is due to the need to go back and forth through the cycle. Second, as discussed in [DBLP:journals/corr/abs-1907-10830], sometimes the methods are unsuccessful for image translation tasks with large shape change, such as in the case of the anime in Figure LABEL:fig:teaser(b). Finally, as explained in [siddiquee2019learning],

it is still a challenge to completely remove large objects, like glasses, from the images, and therefore this task is left for their future work (Figure LABEL:fig:teaser(a)).

We propose a novel approach, termed Council-GAN, which handles these challenges. The key idea is to rely on ”collegiality” between GANs, rather than utilizing a cycle. Specifically, instead of using a single pair of a generator/discriminator ”experts”, it utilizes the collective opinion of a group of pairs (the council) and leverages the variation between the results of the generators.

This leads to a more stable and diverse domain transfer.

To realize this idea, we propose to train a council of multiple council members, requiring them to learn from each other. Each generator in the council gets the same input from the source domain and will produce its own output. However, the outputs produced by the various generators should have some common denominator. For this to happen across all images, the generators have to find common features in the input, which are used to generate their outputs. Each discriminator learns to distinguish between the generated images of its own generator and images produced by the other generators. This forces each generator to converge to a result that is agreeable by the others. Intuitively, this convergence assists to maximize the mutual information between the source domain and the target domain, which explains why the generated images maintain the important features of the source images.

We demonstrate the benefits of our approach for several applications, including glasses removal, face to anime translation, and male to female translation. In all cases we achieve state-of-the-art results.

Hence, this paper makes the following contributions:

  1. We introduce a novel model for unsupervised image-to-image translation, whose key idea is collaboration between multiple generators. Conversely to most recent methods, our model avoids cycle-consistency constraints altogether.

  2. Our council manages to achieve state-of-the-art results in a variety of challenging applications.

2 Related work

Generative adversarial networks (GANs).

Since the introduction of the GAN framework [Goodfellow2014], it has been demonstrated to achieve eye-pleasing results in numerous applications. In this framework, a generator is trained to fool a discriminator, whereas the latter attempts to distinguish between the generated samples and real samples.

A variety of modifications have been proposed in recent years in an attempt to improve GAN’s results; see [Arjovsky2017WassersteinNetworks, Denton2015DeepNetworks, dosovitskiy2016generating, Huang2016StackedNetworks, Karras2018ProgressiveVariation, Mao2017LeastNetworks, RadfordDCGANUNSUPERVISEDNETWORKS, Rosca2017VariationalNetworks, Salimans2016ImprovedGANs, Tolstikhin2018WassersteinAuto-Encoders, ZhangStackGAN:Networks] for a few of them.

We are not the first to propose the use of multiple GANs [Durugkar2016GenerativeNetworks, Ghosh2018Multi-agentNetworks, QuanHoangTuDinhNguyenTrungLe2018, Juefei-Xu2017GangRanking]. However, previous approaches differ from ours both in their architectures and in their goals. For instance, some of previous architectures consist of multiple discriminators and a single generator; conversely, some propose to have a key discriminator that can evaluate the generators’ results and improve them. We propose a novel architecture to realize the concept of a council, as described in Section 3. Furthermore, the goal of other approaches is either to push each other apart, to create diverse solutions, or to improve the results. Our council attempts to find the commonalities between the the source and target domains. By requiring the council members to ”agree” on each other’s results, they in fact learn to focus on the common traits of the domains.

Image-to-image translation.

The aim is to learn a mapping from a source domain to a target domain. Early approaches adopt a supervised framework, in which the model learns paired examples, for instance using a conditional GAN to model the mapping function [Isola2016Image-to-ImageNetworks, wang2018pix2pixHD, Zhu2017].

Recently, numerous methods have been proposed, which use unpaired examples for the learning task and produce highly impressive results; see for example  [Berthelot2017BEGAN:Networks, Gatys2016, Huang2018MultimodalTranslation, kim2017, Lee_2018_ECCV, Liu2017, siddiquee2019learning, Zhu2017UnpairedNetworks], out of a truly extensive literature. This approach is vital to applications for which paired data is unavailable or difficult to gain. Our model belongs to the class of GAN models that do not require paired training data.

A major concern in the unsupervised approach is the type of properties of the source domain that should be preserved. Examples include pixel values [Shrivastava2017LearningTraining], pixel gradients [bousmalis2017unsupervised], pairwise sample distances [benaim2017one], and recently mostly cycle consistency [kim2017, YiDualGAN, Zhu2017UnpairedNetworks]. The latter enforces the constraint that translating an image to the target domain and back, should obtain the original image. Our method avoids using cycles altogether. This has the benefit of bypassing unnecessary constraints on the generated output, and thus avoiding to preserve hidden information [Chu2017].

Finally, most existing methods lack diversity in the results. To address this problem, some methods propose to produce multiple outputs for the same given image [Huang2018MultimodalTranslation, Lee_2018_ECCV]. Our method enables image translation with diverse outputs, however it does so in a manner in which all GANs in the council ”acknowledge” to some degree each other’s output.

3 Model

This section describes our proposed model, which addresses the drawbacks described in Section 1. Our model consists of a set, termed a council, whose members influence each other’s results. Each member of the council has one generator and a couple of discriminators, as described below. The generators need not converge to a specific output; instead, each produces its own results, jointly generating a diverse set of results. During training, they take into account the images produced by the other generators. Intuitively, the mutual influence enforces the generators to focus on joint traits of the images in the source domain, which could be matched to those in the target domain. For instance, in Figure LABEL:fig:teaser, to transform a male into a female, the generators focus on the structure of the face, on which they can all agree upon. Therefore, this feature will be preserved, which can explain the good results.

Furthermore, our model avoids cycle constraints. This means that there is no need to go in both directions between the source domain and the target domains. As a result, there is no need to leave traces on the generated image (e.g. glasses) or to limit the amount of change (e.g. anime).

To realize this idea, we define a council of members as follows (Figure 1). Each member of the council is a triplet, whose components are a single generator and two discriminators & , . The task of discriminator  is to distinguish between the generator’s output and real examples from the target domain, as done in any classical GAN. The goal of discriminator  is to distinguish between images produced by and images produced by the other generators in the council. This discriminator is the core of the model and this is what differentiates our model from the classical GAN model. It enforces the generator to converge to images that could be acknowledged by all council members—images that share similar features.

Figure 1: General approach. The council consists of triplets, each of which contains a generator and two discriminators:   distinguishes between the generator’s output and real examples, whereas  distinguishes between images produced by and images produced by other generators in the council. is the reason that the each of the generators converges to a result that is agreed-upon by all other members of the council.

The loss function of is the classical adversarial loss of [Mao2017LeastNetworks]. Hereafter, we focus on the loss function of , which makes the outputs of the various generators share common traits, while still maintain diversity. At every iteration, gets as input pairs of (input,output) from all the generators in the council. Rather than distinguishing between real & fake, ’s distinguishes between the result of ”my-generator” and the result of ”another-generator”. Hence, during training, attempts to minimize the distance between the outputs of the generators. Note that getting the input and not only the output is important to make the connection, for each pair, between the features of the source image and those of the generated image.

Let be the source domain and be the target domain. In our model we have mappings . Given an image , a straightforward adaptation of the classical adversarial loss to our case would be:


where tries to generate images that look similar to images from domains for . In analogy to the classical adversarial loss, in Equation (1), both terms should be minimized, where the left term learns to ”identify” its corresponding generator as ”fake” and the right term learns to ”identify” the other generators as ”real”.

Figure 2: Zoom into the generator . Our generator is an auto-encoder architecture, which is similar to that of [Huang2018MultimodalTranslation]. The encoder consists of several strided convolutional layers followed by residual blocks. The decoder gets the encoded image (termed the mutual information vector), as well as a random entropy vector. The latter may be interpreted as encoding the leftover information of the target domain. The decoder uses a MLP to produce a set of AdaIN parameters for the random entropy vector [Huang2017ArbitraryST].
(a) Council discriminator (b) GAN discriminator
Figure 3: Differences & similarities between and . While the GAN discriminator distinguishes between ”real” and ”fake” images, the council discriminator distinguishes between outputs of its own generator and those produced by other generators. Furthermore, while the GAN’s discriminator gets as input only the generator’s output, the council’s discriminator gets also the generator’s input. This is because we wish the generator to produce a result that bares similarity to the input image, and not only one that looks real in the target domain.

To allow multimodal translation, we encode the input image, as illustrated in Figure 2, which zooms into the structure of the generator [Huang2018MultimodalTranslation]. The encoded image should carry useful (mutual) information between domains and . Let be the encoder for the source image and let be the random entropy vector, associated with the member of the council, . enables each generator to generate multiple diverse results. Equation (1) is modified so as to get an encoded image (instead of the original input image) and the random entropy vector. The loss function of is then defined as:


Here, the loss function gets, as additional inputs, all the encoders and vector . controls the size of the sub-domain of the other generators, which is important in order to converge to ”acceptable” images.

Figure 3 illustrates the differences and the similarities between discriminators and . Both should distinguish between the generator’s results and other images; in the case of the other images are real images from the target domain, whereas in the case of , they are images generated by other generators in the council. Another fundamental difference is their input: gets not only the generator’s output, but also its input. This aims at producing a resulting image that has common features with the input image.

Final loss.

For each member of the council, we jointly train the generator (assuming the encoder is included) and the discriminators to optimize the final objective. In essence, , , & play a three-way min-max-max game with a value function :


This equation is a weighted sum of the adversarial loss (of ), as defined in [Mao2017LeastNetworks], and the (of ), as defined in Equation (2). controls the importance of looking more ”real” or more inline with the other generators. High values will result in more similar images, whereas low values will require less agreement and result in higher diversity between the generated images.

Focus map.

For some applications, it is preferable to focus on specific areas of the image and modify only them, leaving the rest of the image untouched. This can be easily accommodated into our general scheme, without changing the architecture.

The idea is to let the generator produce not only an image, but also an associated focus map, which essentially segments the learned objects in the domain from the background. All that is needed is to add a fourth channel, , to the generator, which would generate values in the range . These values can be interpreted as the likelihood of a pixel to belong to the background (or to an object). To realize this, Equation (3) becomes



input member1 member2 member3 member4
Figure 4: Importance of the loss function components. This figure shows the results generated by the four council members for the male-to-female application, after iterations. Top: Using the (jointly with the classical ) generates nice images from the target domain, which are not necessarily related to the given image. Middle: Using the instead, relates between the input and the output faces, but might change the environment (background). Bottom: Our loss, which combines the above losses, both relates the input and the output faces and focuses only on facial modifications.

In Equation (3), is the value of the channel for pixel . The first term attempts to minimize the size of the focus mask, i.e. make it focus solely on the object. The second term is in charge of segmenting the image into an object and a background ( or ). This is done in order to avoid generating semi-transparent pixels. In our implementation . The result is normalized by the image size. The values of and are application-dependent and will be defined for each application in Section 5.

Figure 4 illustrates the importance of the various losses.

If only the (jointly with the ) is used, the faces of the input and the output are completely unrelated, though the quality of the images is good and the background does not change in most cases. Using only the , the faces of the input and the output are nicely related, but the background might change. Our loss, which combines the above losses, produces the best results.

We note that this idea of adding a channel, which makes the generator focus on the proper areas of the image, can be used in other GAN architectures. It is not limited to our proposed council architecture.

4 Experiments

4.1 Experiment setup

We applied our council GAN to several challenging image-to-image translation tasks (Section 4.2).

Baseline models. Depending on the application, we compare our results to those of some state-of-the-art models, including CycleGAN [Zhu2017UnpairedNetworks], MUNIT [Huang2018MultimodalTranslation], DRIT++ [Lee_2018_ECCV, DRIT_plus], U-GAT-IT [DBLP:journals/corr/abs-1907-10830], StarGAN [StarGAN], Fixed-PointGAN [siddiquee2019learning]. These methods are unsupervised and use cycle constraints. Out of these methods, MUNIT [Huang2018MultimodalTranslation] and DRIT++ [Lee_2018_ECCV, DRIT_plus] are multi-modal and generate several results for a given image. The others produce a single result.

Datasets. We evaluated the performance of our system on the following datasets.

CelebA [liu2015faceattributes]. This dataset contains face images of celebrities, each annotated with binary attributes. We focus on two attributes: (1) the gender attribute and (2) with/without glasses attribute. The training dataset contains (/) images of males (/with glasses) and (/) images of females (/without glasses). The test dataset consists of (/) males (/with glasses) and (/) females (/without glasses).

selfie2anime [DBLP:journals/corr/abs-1907-10830]. The size of the training dataset is selfie images and anime images. The size of the test dataset is selfie images and anime images.

Training. All models were trained using Adam [kingma2014adam] with and . For data augmentation we flipped the images horizontally with a probability of .

For the selfie/anime dataset , where the number of images is small, we augmented the data also with color jittering with up to , random Grayscale with a probability of , random Rotation with up to , random translation of up to of the image, and with random perspective with distortion scale of with a probability of . On the last iterations we trained only on the original data, without augmentation. We performed one generator update after a number of discriminator updates that is equal to the size of the council. The batch size was set to for all experiments. We trained all models with a learning rate of , where the learning rate drops by a factor of after every iterations.

input ours-1 ours-2 ours-3 ours-4 cycleGAN MUNIT StarGAN DRIT++
[Zhu2017UnpairedNetworks] [Huang2018MultimodalTranslation] [StarGAN] [DRIT_plus, Lee_2018_ECCV]
Figure 5: Male-to-female translation. Our results are more ”feminine” than those generated by other state-of-the-art methods, while still preserving the main facial features of the input images.

Evaluation. We verify our results both qualitatively and quantitatively. For the latter, we use two common measures:

  1. The Frechet Inception Distance score (FID) [heusel2017gans] calculates the distance between the feature vectors of the real and the generated images.

  2. Kernel Inception Distance (KID) [binkowski2018demystifying], which improves on FID and measures GAN convergence.

4.2 Experimental results

input ours-1 ours-2 ours-3 ours-4 cycleGAN MUNIT U-GAT-IT DRIT++
[Zhu2017UnpairedNetworks] [Huang2018MultimodalTranslation] [DBLP:journals/corr/abs-1907-10830] [DRIT_plus, Lee_2018_ECCV]
Figure 6: Selfie-to-anime translation. Our results preserve the structure of the face in the input image, while generating the characteristic features of anime, such as the large eyes.

Experimental results for male-to-female translation. Given an image of a male face, the goal is to generate a female face, which resembles the male face [Almahairi2018AugmentedData, lu2018attribute]. As explained in [Almahairi2018AugmentedData], three features make this translation task challenging: (i) There is no predefined correspondence in real data of each domain. (ii) The relationship is many-to-many between domains, as many male-to-female mappings are possible. (iii) Capturing realistic variations in generated faces requires transformations that go beyond simple color and texture changes.

Figure 5 shows our results, generated by a council of four members, and compares them to those of [StarGAN, Huang2018MultimodalTranslation, DRIT_plus, Zhu2017UnpairedNetworks]. Note that each of the council member may generate multiple results, depending on the random entropy vector. We observe that our generated females are more ”feminine” (e.g., the beards completely disappear and the haircuts are longer), while still preserving the main features of the source male face and resemble it. This can be attributed to the fact that we do not use a cycle to go from a male to a female and back, and thus we do not need to preserve any masculine features. More examples are given in the supplementary materials

Table 1 summarized our quantitative results, where our results are randomly chosen from those generated by the different members of the council. Our results outperform those of other methods in both evaluation metrics.

CycleGAN [Zhu2017UnpairedNetworks] 20.91 0.0012
MUINT [Huang2018MultimodalTranslation] 19.88 0.0013
starGAN [StarGAN] 35.50 0.0027
DIRT++ [DRIT_plus, Lee_2018_ECCV] 26.24 0.0016
Council 18.85 0.0010
Table 1: Quantitative results for male-to-female translation. Our council generates results that outperform other SOTA results. For both measures, the lower the better.

Experimental results for selfie-to-anime translation. Given an image of a human face, the goal is to generate an appealing anime, which resembles the human. This is a challenging task, as not only the style differs, but also the geometric structure of the input and the output greatly varies (e.g. the size of the eyes). This might lead to mismatching of the structures, which would lead to distortions and visual artifacts. This difficulty is added to the three challenges mentioned in the previous application: the lack of predefined correspondence of the domains, the many-to-many relationship, and going beyond color and texture.

Figure 6 shows our results using a council of four and compares them to those of [Huang2018MultimodalTranslation, DBLP:journals/corr/abs-1907-10830, Lee_2018_ECCV, DRIT_plus, Zhu2017UnpairedNetworks]. Our generated anime images are quite often better resemble the input in terms of expression and face structure (i.e., the shape of the chin). This can be explained by the fact that it is easier for the council members to ”agree” on features that exist in the input.

Table 2 shows quantitative results. It can be seen that our results outperform or are competitive with those of other methods in both evaluation metrics.

CycleGAN [Zhu2017UnpairedNetworks] 149.38 0.0056
MUINT [Huang2018MultimodalTranslation] 131.69 0.0057
U-GAT-IT [DBLP:journals/corr/abs-1907-10830] 115.11 0.0043
DIRT++ [DRIT_plus, Lee_2018_ECCV] 109.22 0.0020
Council 101.39 0.0020
Table 2: Quantitative results for selfie-to-anime translation. Our results outperform those of other methods when FID is considered and are competitive for KID.

Experimental results for glasses removal. Given an image of a person with glasses, the goal is to generate an image of the same person, but with the glasses removed. While in the previous application, the whole image changes, here the challenge is to modify only a certain part of the face and leave the rest of the image untouched.

Figure 7 shows our results when using a council of four and compares them to those of [siddiquee2019learning], which shows results for this application, as well as to [Zhu2017UnpairedNetworks]. Our generated images leave considerably less traces of the removed glasses. Again, this can be attributed lack of the cycle constraint.

input ours Fixed-Point cycleGAN
 [siddiquee2019learning]  [Zhu2017UnpairedNetworks]
Figure 7: Glasses removal. We show a single result per input, since multi-modality is irrelevant for this application. Our generated images remove the glasses almost completely, whereas traces are left in [siddiquee2019learning]’s and in  [Zhu2017UnpairedNetworks]’s results.

Table 3 provides quantitative results. For this application as well, our council manages to outperform other methods and address the challenge of removing large objects.

cycleGAN [Zhu2017UnpairedNetworks] 50.72 0.0038
Fixed-point GAN [siddiquee2019learning] 55.26 0.0041
Council 36.38 0.0026
Table 3: Quantitative results of glasses removal. The results of our council outperform state-of-the-art results

5 Implementation

Our code is based on PyTorch; it will be provided as an open-source upon acceptance. We set the major parameters as follows: , which controls diversity (Equation (2)), is set to . , which controls the size of the mask (Equation (3)), is set to . and from Equation (4) are set according to the applications: in male to female & ; in selfie to anime & ; in glasses removal & .

Figure 8 studies the influence of the number of members and the number of iterations on the quality of the results. We focus on the male-to-female application, which is representative. The fewer the number of members in the council, the faster the convergence is. However, this comes at a price: the accuracy is worse. Furthermore, it can be seen that the KID improves with iterations, as expected.

Figure 8: KID as a function of # iterations. The more iterations, the better KID. Moreover, with more council members, model converges more slowly, yet the results improve.


Figure 9 demonstrates a limitation of our method. When removing the glasses, the face might also become more feminine. This is attributed to the imbalance inherent to the dataset. Specifically, the ratio of the men to women with glasses is , whereas the ratio of men to women without glasses is only . The result of this imbalance in the target domain is that removing glasses also means becoming more feminine. This problem can be solved by providing a dataset with an equal number of males and females with and without glasses. Handling feature imbalance without changing the number of images in the dataset, is an interesting direction for future research.

input result input result
(a) glasses removal (a) male to female
Figure 9: Limitation. (a) When removing the glasses, the face also becomes more feminine. (b) Conversely, when transforming a male to a female, the glasses may also be removed. This is attributed to high imbalance of the relevant features in the dataset.

6 Conclusion

This paper introduces the concept of a council of GANs—a novel approach to perform image-to-image translation between unpaired domains. They key idea is to replace the widely-used cycle-consistency constraint by leveraging collaboration between GANs. Council members assist each other to improve, each its own result.

Furthermore, the paper proposes an implementation of this concept and demonstrates its benefits for three challenging applications. The members of the council generate several optional results for a given input. They manage to remove large objects from the images, not to leave redundant traces from the input and to handle large shape modifications. The results outperform those of SOTA algorithms both quantitatively and qualitatively.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description