###### Abstract

Standard adversarial training involves two agents, namely a generator and a discriminator, playing a mini-max game. However, even if the players converge to an equilibrium, the generator may only recover a part of the target data distribution, in a situation commonly referred to as mode collapse. In this work, we present the Domain Partitioning Network (DoPaNet), a new approach to deal with mode collapse in generative adversarial learning. We employ multiple discriminators, each encouraging the generator to cover a different part of the target distribution. To ensure these parts do not overlap and collapse into the same mode, we add a classifier as a third agent in the game. The classifier decides which discriminator the generator is trained against for each sample. Through experiments on toy examples and real images, we show the merits of DoPaNet in covering the real distribution and its superiority with respect to the competing methods. Besides, we also show that we can control the modes from which samples are generated using DoPaNet.

oddsidemargin has been altered.

marginparsep has been altered.

topmargin has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

Domain Partitioning Network

Botos Csaba, Adnane Boukhayma, Viveka Kulharia, András Horváth, Philip H. S. Torr University of Oxford, United Kingdom Pázmány Péter Catholic University, Hungary {csbotos,viveka}@robots.ox.ac.uk,{adnane.boukhayma,philip.torr}@eng.ox.ac.uk horvath.andras@itk.ppke.hu

Generative Adversarial Networks (Goodfellow et al., 2014a) (GANs) consist of a deep generative model which is trained through a minimax game involving a competing generator and discriminator. The discriminator is tasked to differentiate real from fake samples, whereas the generator strives to maximize the mistakes of the discriminator. At convergence, the generator can sample from an estimate of the underlying real data distribution. The generated images, are observed to be of higher quality than models trained using maximum likelihood optimization. Consequently, GANs have demonstrated impressive results in various domains such as image generation (Gulrajani et al., 2017), video generation (Vondrick et al., 2016), super-resolution (Ledig et al., 2017), semi-supervised learning (Donahue et al., 2017) and domain adaptation (Zhu et al., 2017).

GANs are trained with the objective of reaching a Nash-equilibrium (Mescheder, 2018), which refers to the state where neither the discriminator nor the generator can further enhance their utilities unilaterally. However, the generator might miss some modes of the distribution even after reaching the equilibrium as it can simply fool the discriminator by generating from only few modes of the real distribution (Goodfellow, 2016; Arjovsky & Bottou, 2017; Che et al., 2017; Chen et al., 2016; Salimans et al., 2016), and hence producing a limited diversity in samples. To address this problem, the literature explores two main approaches: Improving GAN learning to practically reach a better optimum (Arjovsky & Bottou, 2017; Metz et al., 2017; Salimans et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Berthelot et al., 2017), or explicitly forcing GANs to produce various modes by design (Chen et al., 2016; Ghosh et al., 2017; Durugkar et al., 2017; Che et al., 2017; Liu & Tuzel, 2016). We hereby follow the latter strategy and propose a new way of dealing with GAN mode collapse. By noticing that using a single discriminator often leads to the generator covering only a part of the data, we bring more discriminators to the game such that each incentivises the generator to cover an additional mode of the data distribution. For each discriminator to focus on a different target mode, we introduce a third player, a classifier that decides the discriminator to be trained using a given real sample. To ensure that these various target data modes do not collapse into the same mode, the classifier also decides the discriminator to train the generator for a given generated sample. We find that this strategy, illustrated in Figure 2, yields better coverage of the real data distribution at convergence and simultaneously improves the stability of the training as well.

We showcase our method on demonstrative toy problems and show that it outperforms competing methods in avoiding mode collapse. We show that the network is able to distinguish different modes of the real data and therefore each discriminator works on a separate mode. This ensures that the generator can sample from a different mode for every input code vector. We also show DoPaNet’s ability to generate good quality and diversified images covering various modes present in the datasets of real images.

We also provide theoretical analysis to show that at global optimum of the objective, the generator replicates the real distribution, categorized into different modes such that it can sample from any mode given the corresponding code vector .

There is a rich literature on improving training stability and increasing sample diversity for GANs. We only focus on a selection of works that relate closely to ours. (Arora et al., 2017b) introduces theoretical formulation stating the importance of multiple generators and discriminators in order to completely model the data distribution. GMAN (Durugkar et al., 2017) proposes using multiple discriminators. They explore 3 settings where the generator can either be trained against the best discriminator, the averaged discriminators, or the weighted averaged discriminators. This helps training the network without modifying the minimax objective. Even though they use multiple discriminators, all of them are trained using all of the available real data, which does not explicitly help in avoiding mode collapse. We improve on this strategy by adding a classifier as a third component, with the task of choosing the discriminator for the given input sample during training, therefore each of the multiple discriminators specializes on a different part of the real data distribution. We also compare DoPaNet with GMAN (Durugkar et al., 2017) in our experiments (Section id1). Triple-GAN (Li et al., 2017) incorporates a classifier in the adversarial training but it focuses on semi-supervised learning and therefore it needs some part of the real data to be labeled during training. It uses only one discriminator which is also conditioned on the sample labels. Contrarily our aim is to circumvent the mode collapse problem in the general case where the labels of the samples may not be available. InfoGAN (Chen et al., 2016) uses a network to maximize mutual information between the input code to the generator and its generated samples. It helps in disentangling several factors of variation, e.g. writing styles in case of digits, pose from lightning, etc. It is different from our approach as it uses the network as well to train the generator. Hence it is possible that the generator colludes with in disentangling the factors of variation, but simultaneously fooling the discriminator, while sampling from only few modes of the data. It can therefore still face the mode collapse problem which we show in the experiments (Section id1). Several works propose using multiple generators (Arora et al., 2017a; Ghosh et al., 2016; Liu & Tuzel, 2016). For instance, MAD-GAN (Ghosh et al., 2017) improves the learning by compelling the generators to produce diverse modes implicitly using the discriminator. This is achieved by requiring the discriminator to identify the generator that produced the fake samples along with recognizing fake samples from reals. The discriminator does not explicitly force each generator to capture a different mode, while in our case the generator is urged to capture distinct modes by being trained with different discriminators. We also show DoPaNet’s superiority over MAD-GAN in our experiments (Section id1).

In this section we first briefly discuss the preliminaries (id1): the general objective for training Generative Adversarial Nets and conditional sampling and training. Then we detail the objective of DoPaNet (id1) and how we optimize it.

Generative adversarial networks can be considered as a game, where players in the form of neural networks are optimized against each other. Let be the data distribution and be the distribution learnt by the generator . Different tasks are assigned to the players: firstly, the generator takes an input noise and returns a sample . The discriminator takes an input which can either be a real sample from the training set or a sample produced by the generator. The discriminator then outputs a conditional probability distribution over the source of the sample . In practice is a binary classifier that ideally outputs if the sample is real and if the sample is fake. Formally the following min-max objective is iteratively optimized:

(1) |

The parameters of D are updated to maximize the objective while the generator is trained to minimize it.

We can condition the modeled distribution by making take a code vector as an additional input to produce a sample , as it is done in InfoGAN (Chen et al., 2016) and other conditional variants (Mirza & Osindero, 2014). In our case, we restrict the code vector to have a one-hot encoding. Defining the conditional probability distribution as , we obtain an objective function for the classifier , the general cross-entropy loss:

(2) |

where is the cross entropy function. The conditional variants of the standard GAN settings optimize both Objectives (1) and (2), where may or may not be optimized over Objective (2). We do not use to optimize the Objective (2).

DoPaNet consists of three main components: A conditional generator , a classifier and a set of independent discriminators . We use categorical code vectors with one-hot encoding where is the number of discriminators used. We use the notation to denote the one-hot code vector with value at the index as 1. As illustrated in Figure 2, generates a sample . Next we feed the sample to the classifier Q to get the categorical probability distribution. For each generated sample we draw , i.e. that decides the corresponding discriminator and that is going to process the generated sample. Formally, we define . Similarly, for the real sample , we draw and define the discriminator for the sample . Thus, for every sample, the discriminator used is decided by the classifier . This yields a fully-differentiable computational graph, despite the fact that the sampling operation is non-differentiable. In other words, once is selected using predictions from , the training requires no further modifications to the standard GAN optimization algorithm, therefore it is compatible with all recent advanced variants of GANs. In our experiments we define as a standard normal distribution and as a uniform categorical distribution unless otherwise stated.

Let us define the minimax objective for DoPaNet:

(3) |

We train DoPaNet by iteratively optimizing the following objective function (refer Algorithm 1):

(4) |

The classifier is trained only using Objective 2, and is applied on the generated samples as well as the real samples to decide the discriminator to use. It is optimal when it is able to correctly classify the generated samples into their corresponding ’s. Empirically we observe that the classifier is easily able to reach its optimum, as can be observed in the Figure 6(b) and 6(c), as the blue and orange curves (samples predicted as and respectively) coincide with the samples forming blue and green area (samples generated using and respectively). Interestingly, we observe that the classifier is able to indirectly control the generator through the discriminators as groups its generations according to the code vectors .

Here we provide formal theoretical formulation of our model with proof presented in Appendix id1.

###### Lemma 3.1.

For optimal and fixed , the optimal , is

(5) |

where , , is a probability distribution such that and , and such that .

We can now reformulate the minimax game as

###### Theorem 3.2.

In case of discriminators, the global minimum of is achieved if and only if , . When , the global minimum value of is .

Sampling from is same as sampling from the mode of the real distribution, the mode that covers the set of samples . Please note that we can assume that each of has a disjoint support. Figure 6(b) and 6(c) empirically show that the assumption of disjoint support of the distributions and , which is decided by the classifier , is valid.

So, in theory each should converge to a different mode as the target dataset distribution is itself different . Hence, empirically the number of modes covered should essentially be at least more diverse than the standard GAN model. This is also observed in all our experiments as well as when comparing the Figures 6(a) and 6(b).

###### Corollary 3.2.1.

At global minimum of , the generative model replicates the real distribution , categorized into different modes.

Thus our model DoPaNet can learn the real data distribution while also controlling the diversity of the generations by sampling from a different real mode corresponding to each , which we also verify experimentally in the next section.

We demonstrate the performance of our method DoPaNet on a diverse set of tasks with increasing complexity, involving probability density estimation and image generation. To illustrate the functioning of DoPaNet, we first set up two low-dimensional experiments (Section id1) using Gaussian Mixture Models (GMMs) as the target probability density function: 1D GMM and 2D GMM. For the 1D Gaussian Mixture case, we compare DoPaNet’s robustness against other approaches by reproducing the experiment setting detailed in (Ghosh et al., 2017) and we outperform all competing methods both qualitatively and quantitatively. We also show DoPaNet’s performance using multiple discriminators and show how the training dynamics change according to the number of discriminators. We observe that increasing the number of discriminators improves the performance of the network until the point where the number of discriminators exceeds the number of underlying modes. Using the 2D circular GMM, we show that classifier is able to learn good partitioning of the distribution and therefore each discriminator acts on samples from a different mode unlike GMAN (Durugkar et al., 2017). We show that DoPaNet is able to utilize the capacity of multiple discriminators and we can control the mode the generator samples from using the code . Even in this case, DoPaNet performs better in capturing all the modes.

We finally demonstrate qualitative results on commonly investigated datasets: Stacked-MNIST, CIFAR-10 and CelebA in Section id1. DoPaNet is able to generate good quality diverse samples. In case of CIFAR-10, we also show that we can generate samples from every class given the class label . The information about the network architectures and the implementation details are provided in Appendix id1.

In DoPaNet, the role of the classifier is to partition both the real and generated data-points into different clusters or modes, and each discriminator is consequently only trained on a separate cluster. In order to fully understand how this helps the training, we experimented with two toy datasets obtained using mixture of Gaussian variants: a 1D GMM with 5 modes, as used in (Ghosh et al., 2017), and a 2D circular GMM with 3 and 8 modes on the unit circle.

First, we reproduced the 1D setting in (Ghosh et al., 2017) with 5 modes at and standard deviations respectively and we compare to the numbers reported in that paper in Table 1. We sampled data points each from the real distribution and the generator distribution. For each of these two distributions, we created a histogram using bin size of with bins lying in the range of to . We then obtained Chi-square distance as well as KL divergence between the generator distribution and the true data distribution using these two histograms. To compare against GMAN using different number of discriminators, we used samples (instead of above) and show the results in Table 2 and Figure 7.

We then introduce a 2D experiment setting with 2D Gaussian Mixture Model (GMM). It has multiple modes having covariance matrix of , where is an identity matrix, and equally separated means lying on a unit circle (please refer to Figure 8 for the mode case). For Table 3 we consider modes and construct histograms using samples and bin size of with bins lying in the range of .

For these experiments, we use uniform distribution of dimension for and uniform categorical distribution for to get the generations in both 1D and 2D experiments.

GAN Variants | Chi-square() | KL-Div |
---|---|---|

DCGAN | 0.90 | 0.322 |

WGAN | 1.32 | 0.614 |

BEGAN | 1.06 | 0.944 |

GoGAN | 2.52 | 0.652 |

Unrolled GAN | 3.98 | 1.321 |

Mode-Reg DCGAN | 1.02 | 0.927 |

InfoGAN | 0.83 | 0.210 |

MA-GAN | 1.39 | 0.526 |

MAD-GAN | 0.24 | 0.145 |

GMAN | 1.44 | 0.69 |

DoPaNet | 0.03 | 0.02 |

In Table 1, we show that DoPaNet outperforms other GAN architectures on the 1D task by a large margin in terms of Chi-square distance and KL-Divergence. We believe that the success is due to the classifier ’s capability to learn to partition the underlying distribution easily. We also show in Table 3, that in the 2D task DoPaNet achieves better performance than GMAN (Durugkar et al., 2017) in terms of both KL-Divergence an Chi-square.

Chi-square() | KL-Div | |||

GMAN | DoPaNet | GMAN | DoPaNet | |

2 | 5.006.80 | 1.890.92 | 1.740.63 | 0.810.27 |

3 | 2.962.88 | 1.102.43 | 1.500.57 | 0.550.36 |

4 | 3.412.73 | 0.740.98 | 1.480.42 | 0.500.41 |

5 | 4.623.92 | 0.270.54 | 1.550.30 | 0.250.26 |

6 | 3.943.22 | 0.410.50 | 1.560.22 | 0.350.20 |

7 | 2.841.51 | 0.420.43 | 1.450.38 | 0.360.21 |

8 | 2.801.55 | 0.931.14 | 1.360.43 | 0.560.31 |

We study the change in performance with regards to the number of discriminators used by both GMAN (Durugkar et al., 2017) and DoPaNet. The clustering mechanism with varying number of discriminators is illustrated on the 1D task in Figure 7. We see in this experiment that classes of the generated samples are first attracted towards larger clusters of the real data. By adding more discriminators, the quality of the reconstructed modes is refined. The refinement process starts first with the easiest separation, between the and the peaks, after that the and modes are distinguished by the classifier , and so on. We quantitatively see in Table 2 that increasing the number of discriminators improves the performance of both GMAN and DoPaNet up to a certain point where (number of discriminators) matches the number of modes in the data. After this optimal point, increasing yields a decreasing performance, because already captured modes are oversampled. In Figure 7 we have marked examples of oversampling in the last two columns with symbols.

GAN Variants | Chi-square() | KL-Div |
---|---|---|

Standard GAN | 3.883 | 2.860 |

GMAN | 1.253 | 0.636 |

DoPaNet | 0.449 | 0.246 |

It is interesting to note that when the same experiment was carried out in MADGAN (Ghosh et al., 2017), which uses multiple generators, their performance peaked at unlike GMAN and ours, both of which logically peaked at considering that there are visible modes. This shows a difficulty in tuning the hyper-parameter in (Ghosh et al., 2017) for different applications.

In 2D experiments, for both GMAN (Durugkar et al., 2017) and DoPaNet we experiment with (where is the number of discriminators, is the number of modes) for both quantitative (listed in Table 3) and qualitative results (see Appendix id1), and the setting for qualitative results (illustrated in Figure 8). In all of the runs, DoPaNet was able to capture, and classify all modes of the true distribution correctly, while GMAN (Durugkar et al., 2017) failed on both the as well as the setting.

In Figure 8 we show a circular 2D GMM with 3 modes on the unit circle which is used to train GMAN and DoPaNets. In the case of DoPaNet, it can also be observed (see column 1) that the generator generates from a different mode for a different . We can also visually see that the classifier is indirectly able to control the conditioned samples by routing them to the corresponding discriminators (see columns 2-3). It also illustrates that we are indeed able to utilize the capabilities of multiple ’s as intended: different discriminators begin to specialize on different modes and therefore provide different gradients for the respective mode as well. Although being trained with the generated code vectors only, DoPaNet’s classifier achieves fine partitioning of the original distribution. We suggest that our approach succeeds because each discriminator is fed different samples from the beginning. is initialized to assign each real sample to every discriminator with equal probability, but given that the generator samples different points for every code vector quickly learns the different modes that the samples from are attracted towards (where refers to the conditional distribution modeled by ). Given that the updated is already providing different subsets of the input space to the different discriminators, the discriminators will provide different gradients for each corresponding code vector. Therefore learns to separate the modes of the learnt distributions conditioned on from each other. We argue that GMAN is not able to utilize multiple discriminators in this experiment setup and that most of the learning is done by just a few discriminators rather than their effective ensemble (see Appendix id1).

After investigating the DoPaNet performance on low dimensional tasks, now we validate DoPaNet on real image generation tasks.

We first investigate how well DoPaNet can reconstruct the real distribution of the data using the Stacked-MNIST dataset (Srivastava et al., 2017). This dataset contains three channel color images, containing a randomly selected sample from the MNIST dataset in each channel. This results in ten possible modes on each channel so the number of all the possible modes in the dataset is . It was shown in (Ghosh et al., 2017) that various architectures recovered only a small portion of these modes. A qualitative image depicting the recovered modes using the traditional DCGAN (Radford et al., 2016) architecture and DoPaNet can be seen in Figure 2. We have also measured the Kullback-Leibler divergence between the real distribution and the generated distributions. We compare DoPaNet against the other GAN variants in Table 4.

GAN Variants | KL Div |
---|---|

DCGAN (Radford et al., 2016) | 2.15 |

WGAN (Arjovsky et al., 2017) | 1.02 |

BEGAN (Berthelot et al., 2017) | 1.89 |

GoGAN (Juefei-Xu et al., 2017) | 2.89 |

Unrolled GAN (Metz et al., 2017) | 1.29 |

Mode-Reg DCGAN (Che et al., 2017) | 1.79 |

InfoGAN (Chen et al., 2016) | 2.75 |

MA-GAN (Ghosh et al., 2017) | |

MAD-GAN (Ghosh et al., 2017) | 0.91 |

GMAN (Durugkar et al., 2017) | 2.17 |

DoPaNet (ours) | 0.13 |

To show the image generation capabilities of DoPaNet, we trained the multi discriminator setting on a lower and a higher complexity image generation task, CIFAR-10 and CelebA respectively. We compare our results qualitatively to the ones reported by GMAN (Durugkar et al., 2017) on both tasks in Figure 14.

While learning the distribution of colored images may sound easy, the main challenge is to learn geometric structures from low resolution and reproduce them in various colors, backgrounds, angles etc. Following (Mescheder et al., 2018), the generator takes ground truth label as input as well along with the code , and the discriminator outputs a dimensional output of which only the index is used for training as well as , while is trained just using the code . Thus, the code helps it learn class invariant features. We illustrate in Figure 12 that DoPaNet is capable of capturing these features such as different object orientations and colors depicted in various weather conditions. It is also able to recognize minute details like wheels, horse hair, ship textures, etc. We present more generations corresponding to each of the classes in Appendix id1.

We also show DoPaNet performance on large scale images such as by training a residual network for 100k iterations on the celebA dataset. This dataset contains various modes like lighting, pose, gender, hair style, clothing, facial expressions which are challenging to capture for generative models. In Figure 13 we demonstrate that DoPaNet is capable of recovering the aforementioned visual features.

We conclude that it is not necessary for a generator to have equal capacity adversary to converge, meaning that the standard GAN training procedure could be enhanced with multiple (and even weaker) discriminators specialized only in attracting the model distribution of the generator to their corresponding modes.

DoPaNet is proven experimentally to utilize the capability of multiple discriminators by partitioning the target distributions into several identifiable modes and making each discriminator work on a separate mode. Thus, it reduces the complexity of the modes to be learnt by each discriminator. We show qualitatively and quantitatively that DoPaNet is able to better cover the real distribution. We observe that the generator is also able to sample from different identifiable modes of the data distribution given the corresponding code vectors.

Acknowledgement This work was supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to acknowledge the Royal Academy of Engineering and FiveAI. Viveka is wholly funded by Toyota Research Institute’s grant.

## References

- Arjovsky & Bottou (2017) Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Arora et al. (2017a) Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). In ICML, 2017a.
- Arora et al. (2017b) Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning, 2017b.
- Berthelot et al. (2017) Berthelot, D., Schumm, T., and Metz, L. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
- Che et al. (2017) Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. In ICLR, 2017.
- Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
- Donahue et al. (2017) Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. In ICLR, 2017.
- Durugkar et al. (2017) Durugkar, I., Gemp, I., and Mahadevan, S. Generative multi-adversarial networks. In ICLR, 2017.
- Ghosh et al. (2016) Ghosh, A., Kulharia, V., and Namboodiri, V. Message passing multi-agent gans. arXiv preprint arXiv:1612.01294, 2016.
- Ghosh et al. (2017) Ghosh, A., Kulharia, V., Namboodiri, V., Torr, P. H., and Dokania, P. K. Multi-agent diverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 1(4), 2017.
- Goodfellow (2016) Goodfellow, I. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
- Goodfellow et al. (2014a) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, 2014a.
- Goodfellow et al. (2014b) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014b.
- Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In NIPS, 2017.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Juefei-Xu et al. (2017) Juefei-Xu, F., Boddeti, V. N., and Savvides, M. Gang of gans: Generative adversarial networks with maximum margin ranking. arXiv preprint arXiv:1704.04865, 2017.
- Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
- Ledig et al. (2017) Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A. P., Tejani, A., Totz, J., Wang, Z., et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
- Li et al. (2017) Li, C., Xu, K., Zhu, J., and Zhang, B. Triple generative adversarial nets. In NIPS, 2017.
- Liu & Tuzel (2016) Liu, M.-Y. and Tuzel, O. Coupled generative adversarial networks. In NIPS, 2016.
- Mescheder (2018) Mescheder, L. On the convergence properties of gan training. arXiv preprint arXiv:1801.04406, 2018.
- Mescheder et al. (2018) Mescheder, L., Nowozin, S., and Geiger, A. Which training methods for gans do actually converge? In International Conference on Machine Learning (ICML), 2018.
- Metz et al. (2017) Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Unrolled generative adversarial networks. In ICLR, 2017.
- Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In NIPS, 2016.
- Srivastava et al. (2017) Srivastava, A., Valkoz, L., Russell, C., Gutmann, M. U., and Sutton, C. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318, 2017.
- Vondrick et al. (2016) Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene dynamics. In NIPS, 2016.
- Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.

Appendix Here we first give the theoretical formulation of our work DoPaNet to show that the modes captured should be different for each categorical code and that the standard GAN can be considered as its lower bound on mode collapse. We then give some more experimental insights from the 2D task. We also give some more generations using CIFAR. Later we provide implementation details of the network architectures we used.

Here we present the theoretical formulation for our proposed method DoPaNet.

###### Lemma 3.1.

For optimal and fixed , the optimal , is

(6) |

where , , is a probability distribution such that and , and such that .

###### Proof.

Let us consider a case where we have discriminators. The theoretical formulation for this case can be trivially extended to more number of discriminators. The objective being optimized by the generator and the discriminators is (Obj. 3):

(7) |

When the classifier has converged to its optimal form, the above Equation 7 can be rewritten as:

(8) |

where if . is the categorical distribution and in our case equal probability is assigned to both the values and . Here is that code vector which leads the classifier to pass to for . Please note that we can therefore consider as and as , where and have shared weights except the bias weights in the initial layer. Bias weights in the initial layer are independently trained for and .

The Objective 8 can be rewritten as:

(9) |

where . is a probability distribution such that and where is the set of samples in the mode of the real distribution. So, sampling from is same as sampling from the mode of the real distribution . Therefore, and .

For a fixed generator , and are also fixed. For a given and , the discriminator tries to maximize the quantity (using Objective 9):

(10) |

where such that for . Therefore, for a fixed generator we get the optimal discriminator as:

(11) |

In case of discriminators, the optimal discriminator can be similarly obtained as:

(12) |

∎

###### Theorem 3.2.

In case of discriminators, the global minimum of is achieved if and only if , . When , the global minimum value of is .

###### Proof.

Given the optimal discriminators and , we can reformulate the Objective 9 as:

(13) |

As noted earlier, bias weights in the initial layer of and are independently trained with all the other weights shared. As it empirically turns out, the shared weights help learn the similar features, which are essential in low-level image formation and should be similar even if and were trained independently. So, we can rather relax the restriction and consider and to be independent of each other. So, the objective 13 can be rewritten as:

(14) |

where,

(15) |

This is same as optimizing different - pairs on dataset distributions decided by the classifier based on the target real distribution . Figure 6(b) and 6(c) empirically show that the assumption of disjoint support of the distributions and is valid. The Equation 15 can be rewritten as:

(16) |

We can further reformulate Equation 16 as:

(17) |

where is the Kullback-Leibler divergence, , and are constants such that () for (the first distribution of second term) to be a probability distribution. The Kullback-Leibler divergence between two distributions is always non-negative and, zero iff the two distributions are equal. In above equation, the two terms are zero simultaneously when and the generator distribution is

where () for to be a probability distribution. Therefore, the global minimum of Eq. 17 is achieved iff . The constants in Eq. 17 are chosen such that:

(18) |

Please note that when , the Eq. 17 can be reformulated as:

(19) |

and the global minimum of obtained is . This global minimum value is the same in general case for discriminators when . ∎

###### Corollary 3.2.1.

At global minimum of , the generative model replicates the real distribution , categorized into different modes.

###### Proof.

As noted in the proof of Lemma 3.1, sampling from is same as sampling from the mode of the real distribution . At global minimum of , we have so is able to sample from the mode of the real distribution. As the real distribution is categorized into modes in total and each of can samples from the corresponding modes, so can replicate the real distribution , categorized into different modes. ∎

As discussed in the section id1, here we show qualitative results with (where is the number of discriminators, is the number of modes) whose corresponding quantitative results are mentioned in the Table 3. We illustrate our findings in Figure 17 and Figure 20.

We argue that GMAN is not able to utilize multiple discriminators in this experiment setup and that most of the learning is done by just a few discriminators rather than their effective ensemble (see Appendix id1).

In (4.1.2), under Error Analysis paragraph we claimed that GMAN fails to utilize multiple discriminators to their full potential. In Figure 8 we already have visual proof: the gradient field of the first two discriminators (top row, red and blue) are almost identical to each other, while the gradient of the third network (top row, green) is pointing towards a completely different mode (lower left) in its non-adjacent area, while around this distant mode the magnitude of the gradient is relatively small.

As discussed in the Section id1, here we present some more results obtained for each of the classes of CIFAR-10 in Figure 27 and Figure 32.

Here we present the way we structured our experiments and the details about the network architecture we used in the experiments.

First, we reproduced the 1D setting in (Ghosh et al., 2017) with 5 modes at and standard deviations respectively and we compare to the numbers reported in that paper in Table 1. Second, we compared DoPaNet directly to GMAN (Durugkar et al., 2017) qualitatively in Figure 8 and quantitatively in Table 3 using a circular 2D GMM distribution with 3 and 8 modes respecitvely, around the unit circle. To illustrate the advantage of DoPaNet over GMAN (Durugkar et al., 2017) we plotted the gradient field to visualize the benefit of using multiple discriminators. The gradient field of this setup can be seen in Figure 8. To get quantitative results, we estimated the probability density distribution using a histogram with 1400 bins over the real and generated samples and computed the Chi-square and KL-divergence between the two histograms.

When comparing against other GAN variants, we run the 1D experiments using a fixed set of 200,000 samples from the real distribution and generate 65,536 elements from each model.

Since DoPaNet is directly designed to separate different modes, we outperform all the other methods as shown in Table 1.

In our case, we sample the code vectors for the generator from a categorical distribution with uniform probability. For the best results, we use 5 discriminators in both GMAN (Durugkar et al., 2017) and DoPaNet. For both, we train 3 instances and select the best score from each of them.

In 1D for better non-parametric probability density estimation, we increased the number of generated samples from 65,536 to 1,000,000 samples as done in (Ghosh et al., 2017). For more reliable results on the implied mechanism of both approaches, we run the training 20 times for each algorithm with number of discriminators , totaling 320 training. As in the previous experiment, we chose the best results from each run.

In 2D, for both variants we experiment with (where is the number of discriminators, is the number of Gaussians we used in the mixture) for quantitative results, listed in Table 3 and setting for qualitative results, illustrated in Figure 8. For each experiment we use a fixed set of 1,000,000 samples and take 5 run per each algorithm, then report the best run. We took effort to make sure that the comparison was fair, and used the same set of parameters as it was done in the 1D experiments.

For both the generator and discriminator we use ResNet-architectures (He et al., 2016), with 18 layers each in the CIFAR-10 experiments, and 26 layers each in the CelebA experiments. As was done in (Mescheder, 2018) we multiply the output of the ResNet blocks with 0.1, use 256-dimensional unit Gaussian distribution. For categorical conditional image generation we use an embedding network that projects category indices to 256 dimensional label vector, normalized to the unit sphere. In the case of conditional image generation the classifier is trained on code vectors, so it is constrained to learn the original class labels. We embed the code vector similar to the ground truth labels in this setting for CIFAR-10. We use Leaky-RELU nonlinearities everywhere, without BatchNorm.

Following the considerations in (Mescheder, 2018) for optimizing parameters of , , we use the RMSProp with , , and initial learning rate of . We use a batch size of 64, and train the algorithm for 700,000 and 400,000 iterations for CIFAR-10 and CelebA tasks respectively. Similar to work that provided state of the art results on image generation tasks (Karras et al., 2018; Mescheder, 2018) for visualizing the generator’s progress we use an exponential moving average of the parameters of with decay 0.999.