Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis
Abstract
^{†}^{†} Equal contributionMost conditional generation tasks expect diverse outputs given a single conditional context. However, conditional generative adversarial networks (cGANs) often focus on the prior conditional information and ignore the input noise vectors, which contribute to the output variations. Recent attempts to resolve the mode collapse issue for cGANs are usually taskspecific and computationally expensive. In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs. The proposed method explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training. This mode seeking regularization term is readily applicable to various conditional generation tasks without imposing training overhead or modifying the original network structures. We validate the proposed algorithm on three conditional image synthesis tasks including categorical generation, imagetoimage translation, and texttoimage synthesis with different baseline models. Both qualitative and quantitative results demonstrate the effectiveness of the proposed regularization method for improving diversity without loss of quality.
1 Introduction
Generative adversarial networks (GANs) [8] have been shown to capture complex and highdimensional image data with numerous applications effectively. Built upon GANs, conditional GANs (cGANs) [21] take external information as additional inputs. For image synthesis, cGANs can be applied to various tasks with different conditional contexts. With class labels, cGANs can be applied to categorical image generation. With text sentences, cGANs can be applied to texttoimage synthesis [23, 31]. With images, cGANs have been used in tasks including imagetoimage translation [10, 11, 15, 17, 33, 34], semantic manipulation [30] and style transfer [16].
For most conditional generation tasks, the mappings are in nature multimodal, i.e., a single input context corresponds to multiple plausible outputs. A straightforward approach to handle multimodality is to take random noise vectors along with the conditional contexts as inputs, where the contexts determine the main content and noise vectors are responsible for variations. For instance, in the dogtocat imagetoimage translation task [15], the input dog images decide contents like orientations of heads and positions of facial landmarks, while the noise vectors help the generation of different species. However, cGANs usually suffer from the mode collapse [8, 26] problem, where generators only produce samples from a single or few modes of the distribution and ignore other modes. The noise vectors are ignored or of minor impacts, since cGANs pay more attention to learn from the highdimensional and structured conditional contexts.
There are two main approaches to address the mode collapse problem in GANs. A number of methods focus on discriminators by introducing different divergence metrics [1, 19] and optimization process [6, 20, 26]. The other methods use auxiliary networks such as multiple generators [7, 18] and additional encoders [2, 4, 5, 27]. However, mode collapse is relatively less studied in cGANs. Some recent efforts have been made in the imagetoimage translation task to improve diversity [10, 15, 34]. Similar to the second category with the unconditional setting, these approaches introduce additional encoders and loss functions to encourage the onetoone relationship between the output and the latent code. These methods either entail heavy computational overheads on training or require auxiliary networks that are often taskspecific that cannot be easily extended to other frameworks.
In this work, we propose a mode seeking regularization method that can be applied to cGANs for various tasks to alleviate the mode collapse problem. Given two latent vectors and the corresponding output images, we propose to maximize the ratio of the distance between images with respect to the distance between latent vectors. In other words, this regularization term encourages generators to generate dissimilar images during training. As a result, generators can explore the target distribution, and enhance the chances of generating samples from different modes. On the other hand, we can train the discriminators with dissimilar generated samples to provide gradients from minor modes that are likely to be ignored otherwise. This mode seeking regularization method incurs marginal computational overheads and can be easily embedded in different cGAN frameworks to improve the diversity of synthesized images.
We validate the proposed regularization algorithm through an extensive evaluation of three conditional image synthesis tasks with different baseline models. First, for categorical image generation, we apply the proposed method on DCGAN [22] using the CIFAR10 [13] dataset. Second, for imagetoimage translation, we embed the proposed regularization scheme in Pix2Pix [11] and DRIT [15] using the facades [3], maps [11], Yosemite [33], and catdog [15] datasets. Third, for texttoimage synthesis, we incorporate StackGAN++ [31] with the proposed regularization term using the CUB2002011 [29] dataset. We evaluate the diversity of synthesized images using perceptual distance metrics [32].
However, the diversity metric alone cannot guarantee the similarity between the distribution of generated images and the distribution of real data. Therefore, we adopt two recently proposed binbased metrics [24], the Number of StatisticallyDifferent Bins (NDB) metric which determines the relative proportions of samples fallen into clusters predetermined by real data, and the JensenShannon Divergence (JSD) distance which measures the similarity between bin distributions. Furthermore, to verify that we do not achieve diversity at the expense of realism, we evaluate our method with the Fréchet Inception Distance (FID) [9] as the metric for quality. Experimental results demonstrate that the proposed regularization method can facilitate existing models from various applications achieving better diversity without loss of image quality. Figure 1 shows the effectiveness of the proposed regularization method for existing models.
The main contributions of this work are:

We propose a simple yet effective mode seeking regularization method to address the mode collapse problem in cGANs. This regularization scheme can be readily extended into existing frameworks with marginal training overheads and modifications.

We demonstrate the generalizability of the proposed regularization method on three different conditional generation tasks: categorical generation, imagetoimage translation, and texttoimage synthesis.

Extensive experiments show that the proposed method can facilitate existing models from different tasks achieving better diversity without sacrificing visual quality of the generated images.
Our code and pretrained models are available at https://github.com/HelenMao/MSGAN/.
2 Related Work
Conditional generative adversarial networks.
Generative adversarial networks [1, 8, 19, 22] have been widely used for image synthesis. With adversarial training, generators are encouraged to capture the distribution of real images. On the basis of GANs, conditional GANs synthesize images based on various contexts. For instances, cGANs can generate highresolution images conditioned on lowresolution images [14], translate images between different visual domains [10, 11, 15, 17, 33, 34], generate images with desired style [16], and synthesize images according to sentences [23, 31]. Although cGANs have achieved success in various applications, existing approaches suffer from the mode collapse problem. Since the conditional contexts provide strong structural prior information for the output images and have higher dimensions than the input noise vectors, generators tend to ignore the input noise vectors, which are responsible for the variation of generated images. As a result, the generators are prone to produce images with similar appearances. In this work, we aim to address the mode collapse problem for cGANs.
Reducing mode collapse.
Some methods focus on the discriminator with different optimization process [20] and divergence metrics [1, 19] to stabilize the training process. The minibatch discrimination scheme [26] allows the discriminator to discriminate between whole minibatches of samples instead of between individual samples. In [6], Durugkar et al. use multiple discriminators to address this issue. The other methods use auxiliary networks to alleviate the mode collapse issue. ModeGAN [2] and VEEGAN [27] enforce the bijection mapping between the input noise vectors and generated images with additional encoder networks. Multiple generators [7] and weightsharing generators [18] are developed to capture more modes of the distribution. However, these approaches either entail heavy computational overheads or require modifications of the network structure, and may not be easily applicable to cGANs.
In the field of cGAN, some efforts [10, 15, 34] have been recently made to address the mode collapse issue on the imagetoimage translation task. Similar to ModeGAN and VEEGAN, additional encoders are introduced to provide a bijection constraint between the generated images and input noise vectors. However, these approaches require other taskspecific networks and objective functions. The additional components make the methods less generalizable and incur extra computational loads on training. In contrast, we propose a simple regularization term that imposes no training overheads and requires no modifications of the network structure. Therefore, the proposed method can be readily applied to various conditional generation tasks.
3 Diverse Conditional Image Synthesis
3.1 Preliminaries
The training process of GANs can be formulated as a minimax problem: a discriminator learns to be a classifier by assigning higher discriminative values to the real data samples and lower ones to the generated ones. Meanwhile, a generator aims to fool by synthesizing realistic examples. Through adversarial training, the gradients from will guide toward generating samples with the distribution similar to the real data one.
The mode collapse problem with GANs is well known in the literature. Several methods [2, 26, 27] attribute the missing mode to the lack of penalty when this issue occurs. Since all modes usually have similar discriminative values, larger modes are likely to be favored through the training process based on gradient descent. On the other hand, it is difficult to generate samples from minor modes.
The mode missing problem becomes worse in cGANs. Generally, conditional contexts are highdimensional and structured (e.g., images and sentences) as opposed to the noise vectors. As such, the generators are likely to focus on the contexts and ignore the noise vectors, which account for diversity.
3.2 Mode Seeking GANs
In this work, we propose to alleviate the missing mode problem from the generator perspective. Figure 2 illustrates the main ideas of our approach. Let a latent vector from the latent code space be mapped to the image space . When mode collapse occurs, the mapped images are collapsed into a few modes. Furthermore, when two latent codes and are closer, the mapped images and are more likely to be collapsed into the same mode. To address this issue, we propose a mode seeking regularization term to directly maximize the ratio of the distance between and with respect to the distance between and ,
(1) 
where denotes the distance metric.
The regularization term offers a virtuous circle for training cGANs. It encourages the generator to explore the image space and enhances the chances for generating samples of minor modes. On the other hand, the discriminator is forced to pay attention to generated samples from minor modes. Figure 2 shows a mode collapse situation where two close samples, and , are mapped onto the same mode . However, with the proposed regularization term, is mapped to , which belongs to an unexplored mode . With the adversarial mechanism, the generator will thus have better chances to generate samples of in the following training steps.
As shown in Figure 3, the proposed regularization term can be easily integrated with existing cGANs by appending it to the original objective function.
(2) 
where denotes the original objective function and the weights to control the importance of the regularization. Here, can be as a simple loss function. For example, in categorical generation task,
(3) 
where denote class labels, real images, and noise vectors, respectively. In imagetoimage translation task [11],
(4) 
where denotes input images and is the typical GAN loss. can be arbitrary complex objective function from any task, as shown in Figure 3 (b). We name the proposed method as Mode Seeking GANs (MSGANs).
4 Experiments
We evaluate the proposed regularization method through extensive quantitative and qualitative evaluation. We apply MSGAN to the baseline models from three representative conditional image synthesis tasks: categorical generation, imagetoimage translation, and texttoimage synthesis. Note that we augment the original objective functions with the proposed regularization term while maintaining original network architectures and hyperparameters. We employ norm distance as our distance metrics for both and and set the hyperparameter in all experiments. More implementation and evaluation details, please refer to the appendixes.
Metrics  Models  airplane  automobile  bird  cat  deer 

NDB  DCGAN  
MSGAN  
JS  DCGAN  
MSGAN  
dog  frog  horse  ship  truck  
NDB  DCGAN  
MSGAN  
JS  DCGAN  
MSGAN 
4.1 Evaluation Metrics
We conduct evaluations using the following metrics. FID. To evaluate the quality of the generated images, we use FID [9] to measure the distance between the generated distribution and the real one through features extracted by Inception Network [28]. Lower FID values indicate better quality of the generated images.
LPIPS. To evaluate diversity, we employ LPIPS [32] following [10, 15, 34]. LIPIS measures the average feature distances between generated samples. Higher LPIPS score indicates better diversity among the generated images.
NDB and JSD. To measure the similarity between the distribution between real images and generated one, we adopt two binbased metrics, NDB and JSD, proposed in [24]. These metrics evaluate the extent of mode missing of generative models. Following [24], the training samples are first clustered using Kmeans into different bins which can be viewed as modes of the real data distribution. Then each generated sample is assigned to the bin of its nearest neighbor. We calculate the binproportions of the training samples and the synthesized samples to evaluate the difference between the generated distribution and the real data distribution. NDB score and JSD of the binproportion are then computed to measure the mode collapse. Lower NDB score and JSD mean the generated data distribution approaches the real data distribution better by fitting more modes. Please refer to [24] for more details.
Model  DCGAN  MSGAN 

FID 
4.2 Conditioned on Class Label
We first validate the proposed method on categorical generation. In categorical generation, networks take class labels as conditional contexts to synthesize images of different categories. We apply the regularization term to the baseline framework DCGAN [22].
We conduct experiments on the CIFAR10 [13] dataset which includes images of ten categories. Since images in the CIFAR10 dataset are of size and upsampling degrades the image quality, we do not compute LPIPS in this task. Table 1 and Table 2 present the results of NDB, JS, and FID. MSGAN mitigates the mode collapse issue in most classes while maintaining image quality.
4.3 Conditioned on Image
Imagetoimage translation aims to learn the mapping between two visual domains. Conditioned on images from the source domain, models attempt to synthesize corresponding images in the target domain. Despite the multimodal nature of the imagetoimage translation task, early work [11, 33] abandons noise vectors and performs onetoone mapping since the latent codes are easily ignored during training as shown in [11, 34]. To achieve multimodality, several recent attempts [10, 15, 34] introduce additional encoder networks and objective functions to impose a bijection constraint between the latent code space and the image space. To demonstrate the generalizability, we apply the proposed method to a unimodal model Pix2Pix [11] using paired training data and a multimodal model DRIT [15] using unpaired images.
Datasets  Summer2Winter  Winter2Summer  

DRIT [15]  MSGAN  DRIT [15]  MSGAN  
FID  
NDB  
JSD  
LPIPS  
Datasets  Cat2Dog  Dog2Cat  
DRIT [15]  MSGAN  DRIT [15]  MSGAN  
FID  
NDB  
JSD  
LPIPS 
Conditioned on text descriptions  Conditioned on text codes  

StackGAN++ [31]  MSGAN  StackGAN++ [31]  MSGAN  
FID  
NDB  
JSD  
LPIPS 
4.3.1 Conditioned on Paired Images
We take Pix2Pix as the baseline model. We also compare MSGAN to BicycleGAN [34] which generates diverse images with paired training images. For fair comparisons, architectures of the generator and the discriminator in all methods follow the ones in BicycleGAN [34].
We conduct experiments on the facades and maps datasets. MSGAN obtains consistent improvements on all metrics over Pix2Pix. Moreover, MSGAN demonstrates comparable diversity to BicycleGAN, which applies an additional encoder network. Figure. 4 and Table. 3 demonstrate the qualitative and quantitative results, respectively.
4.3.2 Conditioned on Unpaired Images
We choose DRIT [15], one of the stateoftheart frameworks to generate diverse images with unpaired training data, as the baseline framework. Though DRIT synthesizes diverse images in most cases, mode collapse occurs in some challenging shapevariation cases (e.g., translation between cats and dogs). To demonstrate the robustness of the proposed method, we evaluate on the shapepreserving Yosemite (summerwinter) [33] dataset and the catdog [15] dataset that requires shape variations.
As the quantitative results exhibited in Table. 4, MSGAN performs favorably against DRIT in all metrics on both datasets. Especially in the challenging catdog dataset, MSGAN obtains substantial diversity gains. From the statistical point of view, we visualize the bin proportions of the dogtocat translation in Figure. 6. The graph shows the severe mode collapse issue of DRIT and the substantial improvement with the proposed regularization term. Qualitatively, Figure. 5 shows that MSGAN discovers more modes without the loss of visual quality.
4.4 Conditioned on Text
Texttoimage synthesis targets at generating images conditioned on text descriptions. We integrate the proposed regularization term on StackGAN++ [31] using the CUB2002011 [29] dataset. To improve diversity, StackGAN++ introduces a Conditioning Augmentation (CA) module that reparameterizes text descriptions into text codes of the Gaussian distribution. Instead of applying the regularization term on the semantically meaningful text codes, we focus on exploiting the latent codes randomly sampled from the prior distribution. However, for a fair comparison, we evaluation MSGAN against StackGAN++ in two settings: 1) Perform generation without fixing text codes for text descriptions. In this case, text codes also provide variations for output images. 2) Perform generation with fixed text codes. In this setting, the effects of text codes are excluded.
Table. 5 presents quantitative comparisons between MSGAN and StackGAN++. MSGAN improves the diversity of StackGAN++ and maintains visual quality. To better illustrate the role that latent codes play for the diversity, we show qualitative comparisons with the text codes fixed. In this setting, we do not consider the diversity resulting from CA. Figure. 7 illustrates that latent codes of StackGAN++ have minor effects on the variations of the image. On the contrary, latent codes of MSGAN contribute to various appearances and poses of birds.
4.5 Interpolation of Latent Space in MSGAN
We perform linear interpolation between two given latent codes and generate corresponding images to have a better understanding of how well MSGAN exploit the latent space. Figure. 8 shows the interpolation results on the dogtocat translation and the texttoimage synthesis task. In the dogtocat translation, we can see the coat colors and patterns varies smoothly along with the latent vectors. In the texttoimage synthesis, both orientations of birds and the appearances of footholds change gradually with the variations of the latent codes.
5 Conclusions
In this work, we present a simple but effective mode seeking regularization term on the generator to address the model collapse in cGANs. By maximizing the distance between generated images with respect to that between the corresponding latent codes, the regularization term forces the generators to explore more minor modes. The proposed regularization method can be readily integrated with existing cGANs framework without imposing training overhead and modifications of network structures. We demonstrate the generalizability of the proposed method on three different conditional generation task including categorical generation, imagetoimage translation, and texttoimage synthesis. Both qualitative and quantitative results show that the proposed regularization term facilitates the baseline frameworks improving the diversity without sacrificing visual quality of the generated images.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 [2] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. In ICLR, 2017.
 [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
 [4] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.
 [5] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2017.
 [6] I. Durugkar, I. Gemp, and S. Mahadevan. Generative multiadversarial networks. In ICLR, 2017.
 [7] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania. Multiagent diverse generative adversarial networks. In CVPR, 2018.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, 2017.
 [10] X. Huang, M.Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised imagetoimage translation. In ECCV, 2018.
 [11] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. In CVPR, 2017.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [13] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [14] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photorealistic single image superresolution using a generative adversarial network. In CVPR, 2017.
 [15] H.Y. Lee, H.Y. Tseng, J.B. Huang, M. K. Singh, and M.H. Yang. Diverse imagetoimage translation via disentangled representations. In ECCV, 2018.
 [16] C. Li and M. Wand. Precomputed realtime texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
 [17] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In NIPS, 2017.
 [18] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
 [19] X. Mao, Q. Li, H. Xie, R. YK, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
 [20] L. Metz, B. Poole, D. Pfau, and J. SohlDickstein. Unrolled generative adversarial networks. In ICLR, 2017.
 [21] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [22] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [23] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
 [24] E. Richardson and Y. Weiss. On GANs and GMMs. In NIPS, 2018.
 [25] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. 2015.
 [26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
 [27] A. Srivastava, L. Valkoz, C. Russell, M. U. Gutmann, and C. Sutton. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In NIPS, 2017.
 [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltechucsd birds2002011 dataset. Technical Report CNSTR2011001, California Institute of Technology, 2011.
 [30] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional GANs. In CVPR, 2018.
 [31] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 2018.
 [32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
 [33] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In ICCV, 2017.
 [34] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal imagetoimage translation. In NIPS, 2017.
Appendix A Implementation Details
Table 6 summarizes the datasets and baseline models used on various tasks. For all of the baseline methods, we incorporate the original objective functions with the proposed regularization term. Note that we remain the original network architecture design and use the default setting of hyperparameters for the training.
DCGAN.
Since the images in the CIFAR10 [13] dataset are of size , we modify the structure of the generator and discriminator in DCGAN [22], as shown in Table 7. We use the batch size of , learning rate of and Adam [12] optimizer with and to train both the baseline and MSGAN network.
Pix2Pix.
We adopt the generator and discriminator in BicycleGAN [33] to build the Pix2Pix [11] model. Same as BicycleGAN, we use a UNet network [25] for the generator, and inject the latent codes into every layer of the generator. The architecture of the discriminator is a twoscale PatchGAN network [11]. For the training, both Pix2Pix and MSGAN framework use the same hyperparameters as the officially released version ^{1}^{1}1https://github.com/junyanz/BicycleGAN/.
DRIT.
DRIT [15] involves two stages of imagetoimage translations in the training process. We only apply the mode seeking regularization term to generators in the first stage, which is modified on the officially released code ^{2}^{2}2https://github.com/HsinYingLee/DRIT.
StackGAN++.
StackGAN++ [31] is a treelike structure with multiple generators and discriminators. We use the output images from the last generator and input latent codes to calculate the mode seeking regularization term. The implementation is based on the officially released code ^{3}^{3}3https://github.com/hanzhanggit/StackGANv2.
Appendix B Evaluation Details
We employ the official implementation of FID ^{4}^{4}4https://github.com/bioinfjku/TTUR, NDB and JSD ^{5}^{5}5https://github.com/eitanrich/gansngmms, and LPIPS ^{6}^{6}6https://github.com/richzhang/PerceptualSimilarity. For NDB and JSD, we use the Kmeans method on training samples to obtain the clusters. Then the generated samples are assigned to the nearest cluster to compute the bin proportions. As suggested by the author of [24], there are at least training samples for each cluster. Therefore, we cluster the number of bins in all tasks, where denotes the number of training samples for computing the clusters. We have verified that the performance is consistent within a large range of . For evaluation, we randomly generate images for a given conditional context on various tasks. We conduct five independent trials and report the mean and standard derivation based on the result of each trial. More evaluation details of one trial are presented as follows.
Conditioned on Class Label.
We randomly generate images for each class label. We use all the training samples and the generated samples to compute FID. For NDB and JSD, we employ the training samples in each class to calculate clusters.
Conditioned on Image.
We randomly generate images for each input image in the test set. For LPIPS, we randomly select pairs of the images of each context in the test set to compute LPIPS and average all the values for this trial. Then, we randomly choose input images and their corresponding generated images to form generated samples. We use the generated samples and all samples in training set to compute FID. For NDB and JSD, we employ all the training samples for clustering and choose bins for facades, and bins for other datasets.
Conditioned on Text.
We randomly select sentences and generate images for each sentence, which forms generated samples. Then, we randomly select samples for computing FID, and clustering them into bins for NDB and JSD. For LPIPS, we randomly choose pairs for each sentence and average the values of all the pairs for this trial.
Appendix C Ablation Study on the Regularization Term
c.1 The Weighting Parameter
To analyze the influence of the regularization term, we conduct an ablation study by varying the weighting parameter on imagetoimage translation task using the facades dataset. Figure 9 presents the qualitative and quantitative results. It can be observed that increasing improves both the quality and diversity of the generated images. Nevertheless, as the weighting parameter becomes larger than a threshold value (), the training becomes unstable, which yields low quality, and even low diversity synthesized images. As a result, we empirically set the weighting parameter for all experiments.
c.2 The Design Choice of the Distance Metric
We have explored other design choice of the distance metric. We conduct experiments using discriminator feature distance in our regularization term in a way similar to feature matching loss [30],
(5) 
where denotes the layer of the discriminator. We apply it to Pix2Pix on the facades dataset. Table. 8 shows that MSGAN using feature distance also obtains improvement over Pix2Pix. However, MSGAN using distance has higher diversity. Therefore, we employ MSGAN using distance for all experiments.
Appendix D Computational Overheads
We compare MSGAN with Pix2Pix, BicycleGAN in terms of training time, memory consumption, and model parameters on an NVIDIA TITAN X GPU. Table. 9 shows that our method incurs marginal overheads. However, BicycleGAN requires longer time per iteration and larger memory with an additional encoder and another discriminator network.
Appendix E Additional Results
We present more results of categorical generation, imagetoimage translation, and texttoimage synthesis in Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, and Figure 16, respectively.
Context  Class Label  Paired Images  Unpaired Images  Text  

Dataset  CIFAR10 [13]  Facades [3]  Maps [11]  Yosemite [33]  Cat Dog [15]  CUB2002011 [29]  
Summer  Winter  Cat  Dog  
train  test  train  test  train  test  train  test  train  test  train  test  train  test  train  test  
Samples  
Baseline  DCGAN [22]  Pix2Pix [11]  DRIT [15]  StackGAN++ [31] 
Layer  Generator  Discriminator 

Dconv(N512K4S1P0), BN, Relu  Conv(N128K4S2P1), LeakyRelu  
Dconv(N256K4S2P1), BN, Relu  Conv(N256K4S2P1), BN, LeakyRelu  
Dconv(N128K4S2P1), BN, Relu  Conv(N512K4S2P1), BN, LeakyRelu  
Dconv(N3K4S2P1), Tanh  Conv(N1K4S1P0), Sigmoid 
Pix2Pix [11]  MSGANL  MSGANFD  

FID  
NDB  
JSD  
LPIPS 