Dual Discriminator Generative Adversarial Nets
Abstract
We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the KullbackLeibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multimodes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and realworld largescale datasets (MNIST, CIFAR10, STL10, ImageNet), where we have made our best effort to compare our D2GAN with the latest stateoftheart GAN’s variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database.
Dual Discriminator Generative Adversarial Nets
Tu Dinh Nguyen, Trung Le, Hung Vu, Dinh Phung Centre for Pattern Recognition and Data Analytics Deakin University, Australia {tu.nguyen,trung.l,hungv,dinh.phung}@deakin.edu.au
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
Generative models are a subarea of research that has been rapidly growing in recent years, and successfully applied in a wide range of modern realworld applications (e.g., see chapter 20 in [9]). Their common approach is to address the density estimation problem where one aims to learn a model distribution that approximates the true, but unknown, data distribution . Methods in this approach deal with two fundamental problems. First, the learning behaviors and performance of generative models depend on the choice of objective functions to train them [28, 13]. The most widelyused objective, considered the defacto standard one, is to follow the principle of maximum likelihood estimate that seeks model parameters to maximize the likelihood of training data. This is equivalent to minimizing the KullbackLeibler (KL) divergence between data and model distributions: . It has been observed that this minimization tends to result in that covers multiple modes of , but may produce completely unseen and potentially undesirable samples [28]. By contrast, another approach is to swap the arguments and instead, minimize: , which is usually referred to as the reverse KL divergence [20, 11, 13, 28]. It is observed that optimization towards the reverse KL divergence criteria mimics the modeseeking process where concentrates on a single mode of while ignoring other modes, known as the problem of mode collapse. These behaviors are wellstudied in [28, 13, 11].
The second problem is the choice of formulation for the density function of [9]. One might choose to define an explicit density function, and then straightforwardly follow maximum likelihood framework to estimate the parameters. Another approach is to estimate the data distribution using an implicit density function, without the need for analytical forms of (e.g., see [11] for further discussions). An idea is to borrow the principle of minimal enclosing ball [26] to train a generator in such a way that both training and generated data, after being mapped to the feature space, are enclosed in the same sphere [27]. However, the most notably pioneered class of this approach is the generative adversarial network (GAN) [10], an expressive generative model that is capable of producing sharp and realistic images for natural scenes. Different from most generative models that maximize data likelihood or its lower bound, GAN takes a radical approach that simulates a game between two players: a generator that generates data by mapping samples from a noise space to the input space; and a discriminator that acts as a classifier to distinguish real samples of a dataset from fake samples produced by the generator . Both and are parameterized via neural networks, thus this method can be categorized into the family of deep generative models or generative neural models [9].
The optimization of GAN formulates a minimax problem, wherein given an optimal , the learning objective turns into finding that minimizes the JensenShannon divergence (JSD): . The behavior of JSD minimization has been empirically proven to be more similar to reverse KL than to KL divergence [28, 13]. This, however, leads to the aforementioned issue of mode collapse, which is indeed a notorious failure of GAN [11] where the generator only produces similarly looking images, yielding a low entropy distribution with poor variety of samples.
Recent attempts have been made to solve the mode collapsing problem by improving the training of GAN. One idea is to use the minibatch discrimination trick [24] to allow the discriminator to detect samples that are unusually similar to other generated samples. Although this heuristics helps to generate visually appealing samples very quickly, it is computationally expensive, thus normally used in the last hidden layer of discriminator. Another approach is to unroll the optimization of discriminator by several steps to create a surrogate objective for the update of generator during training [18]. The third approach is to train many generators that discover different modes of the data [31]. Alternatively, around the same time, there are various attempts to employ autoencoders as regularizers or auxiliary losses to penalize missing modes [5, 30, 4, 29]. These models can avoid the mode collapsing problem to a certain extent, but at the cost of computational complexity with the exception of DFM in [30], rendering them unscalable up to ImageNet, a largescale and challenging visual dataset.
Addressing these challenges, we propose a novel approach to both effectively avoid mode collapse and efficiently scale up to very large datasets (e.g., ImageNet). Our approach combines the KL and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multimodes. We materialize our idea using GAN’s framework, resulting in a novel generative adversarial architecture containing three players: a discriminator that rewards high scores for data sampled from rather than generated from the generator distribution whilst another discriminator , conversely, favoring data from rather , and a generator that generates data to fool both two discriminators. We term our proposed model dual discriminator generative adversarial network (D2GAN).
It turns out that training D2GAN shares the same minimax problem as in GAN, which can be solved by alternatively updating the generator and discriminators. We provide theoretical analysis showing that, given , and with enough capacity, i.e., in the nonparametric limit, at the optimal points, the training criterion indeed results in the minimal distance between data and model distribution with respect to both their KL and reverse KL divergences. This helps the model place fair distribution of probability mass across the modes of the data generating distribution, thus allowing one to recover the data distribution and generate diverse samples using the generator in a single shot. In addition, we further introduce hyperparameters to stabilize the learning and control the effect of each divergence.
We conduct extensive experiments on one synthetic dataset and four realworld largescale datasets (MNIST, CIFAR10, STL10, ImageNet) of very different nature. Since evaluating generative models is notoriously hard [28], we have made our best effort to adopt a number of evaluation metrics from literature to quantitatively compare our proposed model with the latest stateoftheart baselines whenever possible. The experimental results reveal that our method is capable of improving the diversity while keeping good quality of generated samples. More importantly, our proposed model can be scaled up to train on the largescale ImageNet database, obtain a competitive variety score and generate reasonably good quality images.
In short, our main contributions are: (i) a novel generative adversarial model that encourages the diversity of samples produced by the generator; (ii) a theoretical analysis to prove that our objective is optimized towards minimizing both KL and reverse KL divergence and has a global optimum where ; and (iii) a comprehensive evaluation on the effectiveness of our proposed method using a wide range of quantitative criteria on largescale datasets.
2 Generative Adversarial Nets
We first review the generative adversarial network (GAN) that was introduced in [10] to formulate a game of two players: a discriminator and a generator . The discriminator, , takes a point in data space and computes the probability that is sampled from data distribution , rather than generated by the generator . At the same time, the generator first maps a noise vector drawn from a prior to the data space, obtaining a sample that resembles the training data, and then uses this sample to challenge the discriminator. The mapping induces a generator distribution in data domain with probability density function . Both and are parameterized by neural networks (see Fig. (a)a for an illustration) and learned by solving the following minimax optimization:
The learning follows an iterative procedure wherein the discriminator and generator are alternatively updated. Given a fixed , the maximization subject to results in the optimal discriminator , whilst given this optimal , the minimization of turns into minimizing the JensenShannon (JS) divergence between the data and model distributions: [10]. At the Nash equilibrium of a game, the model distribution recovers the data distribution exactly: , thus the discriminator now fails to differentiate real or fake data as .
3 Dual Discriminator Generative Adversarial Nets
To tackle GAN’s problem of mode collapse, in what follows we present our main contribution of a framework that seeks an approximated distribution to effectively cover many modes of the multimodal data. Our intuition is based on GAN, but we formulate a threeplayer game that consists of two different discriminators and , and one generator . Given a sample in data space, rewards a high score if is drawn from the data distribution , and gives a low score if generated from the model distribution . In contrast, returns a high score for generated from whilst giving a low score for a sample drawn from . Unlike GAN, the scores returned by our discriminators are values in rather than probabilities in . Our generator performs a similar role to that of GAN, i.e., producing data mapped from a noise space to synthesize the real data and then fool both two discriminators and . All three players are parameterized by neural networks wherein and do not share their parameters. We term our proposed model dual discriminator generative adversarial network (D2GAN). Fig. (b)b shows an illustration of D2GAN.
More formally, , and now play the following threeplayer minimax optimization game:
(1) 
wherein we have introduced hyperparameters to serve two purposes. The first is to stabilize the learning of our model. As the output values of two discriminators are positive and unbounded, and in Eq. (1) can become very large and have exponentially stronger impact on the optimization than and do, rendering the learning unstable. To overcome this issue, we can decrease and , in effect making the optimization penalize and , thus helping to stabilize the learning. The second purpose of introducing and is to control the effect of KL and reverse KL divergences on the optimization problem. This will be discussed in the following part once we have the derivation of our optimal solution.
Similar to GAN [10], our proposed network can be trained by alternatively updating and . We refer to the supplementary material for the pseudocode of learning parameters for D2GAN.
3.1 Theoretical analysis
We now provide formal theoretical analysis of our proposed model, that essentially shows that, given , and are of enough capacity, i.e., in the nonparametric limit, at the optimal points, can recover the data distributions by minimizing both KL and reverse KL divergences between model and data distributions. We first consider the optimization problem with respect to (w.r.t) discriminators given a fixed generator.
Proposition 1.
Given a fixed , maximizing yields to the following closedform optimal discriminators :
and 
Proof.
According to the induced measure theorem [12], two expectations are equal: where or . The objective function can be rewritten as below:
Considering the function inside the integral, given , we maximize this function w.r.t two variables to find and . Setting the derivatives w.r.t and to , we gain:
and  (2) 
The second derivatives: and are nonpositive, thus verifying that we have obtained the maximum solution and concluding the proof. ∎
Next, we fix and find the optimal solution for the generator .
Theorem 2.
Given , at the Nash equilibrium point for minimax optimization problem of D2GAN, we have the following form for each component:
Proof.
Substituting from Eq. (2) into the objective function in Eq. (1) of the minimax problem, we gain:
(3) 
where and is the KL and reverse KL divergences between data and model (generator) distributions, respectively. These divergences are always nonnegative and only zero when two distributions are equal: . In other words, the generator induces a distribution that is identical to the data distribution , and two discriminators now fail to recognize the real or fake samples since they return the same score of 1 for both samples. This concludes the proof. ∎
The loss of generator in Eq. (3) shows that increasing promotes the optimization towards minimizing the KL divergence , thus helping the generative distribution cover multiple modes, but may include potentially undesirable samples; whereas increasing encourages the minimization of the reverse KL divergence , hence enabling the generator capture a single mode better, but may miss many modes. By empirically adjusting these two hyperparameters, we can balance the effect of two divergences, and hence effectively avoid the mode collapsing issue.
4 Experiments
In this section, we conduct comprehensive experiments to demonstrate the capability of improving mode coverage and the scalability of our proposed model on largescale datasets. We use a synthetic 2D dataset for both visual and numerical verification, and four datasets of increasing diversity and size for numerical verification. We have made our best effort to compare the results of our method with those of the latest stateoftheart GAN’s variants by replicating experimental settings in the original work whenever possible.
For each experiment, we refer to the supplementary material for model architectures and additional results. Common points are: i) discriminators’ outputs with softplus activations :, i.e., positive version of ReLU; (ii) Adam optimizer [14] with learning rate 0.0002 and the firstorder momentum 0.5; (iii) minibatch size of 64 samples for training both generator and discriminators; (iv) Leaky ReLU with the slope of 0.2; and (v) weights initialized from an isotropic Gaussian: and zero biases. Our implementation is in TensorFlow [1] and will be released once published. We now present our experiments on synthetic data followed by those on largescale realworld datasets.
4.1 Synthetic data
In the first experiment, we reuse the experimental design proposed in [18] to investigate how well our D2GAN can deal with multiple modes in the data. More specifically, we sample training data from a 2D mixture of 8 Gaussian distributions with a covariance matrix 0.02 and means arranged in a circle of zero centroid and radius . Data in these low variance mixture components are separated by an area of very low density. The aim is to examine properties such as low probability regions and low separation of modes.
We use a simple architecture of a generator with two fully connected hidden layers and discriminators with one hidden layer of ReLU activations. This setting is identical, thus ensures a fair comparison with UnrolledGAN^{1}^{1}1We obtain the code of UnrolledGAN for 2D data from the link authors provided in [18]. [18]. Fig. (c)c shows the evolution of 512 samples generated by our models and baselines through time. It can be seen that the regular GAN generates data collapsing into a single mode hovering around the valid modes of data distribution, thus reflecting the mode collapse in GAN. At the same time, UnrolledGAN and D2GAN distribute data around all 8 mixture components, and hence demonstrating the abilities to successfully learn multimodal data in this case. At the last steps, our D2GAN captures data modes more precisely than UnrolledGAN as, in each mode, the UnrolledGAN generates data that concentrate only on several points around the mode’s centroid, thus seems to produce fewer samples than D2GAN whose samples fairly spread out the entire mode.
Next we further quantitatively compare the quality of generated data. Since we know the true distribution in this case, we employ two measures, namely symmetric KL divergence and Wasserstein distance. These measures compute the distance between the normalized histograms of 10,000 points generated from our D2GAN, UnrolledGAN and GAN to true . Figs. (a)a and (b)b again clearly demonstrate the superiority of our approach over GAN and UnrolledGAN w.r.t both distances (lower is better); notably with Wasserstein metric, the distance from ours to the true distribution almost reduces to zero. These figures also demonstrate the stability of our D2GAN (red curves) during training as it is much less fluctuating compared with GAN (green curves) and UnrolledGAN (blue curves).
4.2 Realworld datasets
We now examine the performance of our proposed method on realworld datasets with increasing diversities and sizes. For networks containing convolutional layers, we closely follow the DCGAN’s design [21]. We use strided convolutions for discriminators and fractionalstrided convolutions for generator instead of pooling layers. Batch normalization is applied for each layer, except the generator output layer and the discriminator input layers. We also use Leaky ReLU activations for discriminators, and use ReLU for generator, except its output is tanh since we rescale the pixel intensities into the range of [1, 1] before feeding images to our model. Only one difference is that, for our model, initializing the weights from yields slightly better results than from . We again refer to the supplementary material for detailed architectures.
4.2.1 Evaluation protocol
Evaluating the quality of image produced by generative models is a notoriously challenging due to the variety of probability criteria and the lack of a perceptually meaningful image similarity metric [28]. Even when a model can generate plausible images, it is not useful if those images are visually similar. Therefore, in order to quantify the performance of covering data modes as well as producing high quality samples, we use several different adhoc metrics for different experiments to compare with other baselines.
First we adopt the Inception score proposed in [24], which are computed by: , where is the conditional label distribution for image estimated using a pretrained Inception model [25], and is the marginal distribution: . This metric rewards good and varied samples, but sometimes is easily fooled by a model that collapses and generates to a very low quality image, thus fails to measure whether a model has been trapped into one bad mode. To address this problem, for labeled datasets, we further recruit the socalled MODE score introduced in [5]:
where is the empirical distribution of labels estimated from training data. The score can adequately reflect the variety and visual quality of images, which is discussed in [5].
4.2.2 Handwritten digit images
We start with the handwritten digit images – MNIST [17] that consists of 60,000 training and 10,000 testing 2828 grayscale images of digits from 0 to 9. Following the setting in [5], we first assume that the MNIST has 10 modes, representing connected component in the data manifold, associated with 10 digit classes. We then also perform an extensive grid search of different hyperparameter configurations, wherein our two regularized constants in Eq. (1) are varied in {0.01, 0.05, 0.1, 0.2}. For a fair comparison, we use the same parameter ranges and fully connected layers for our network (c.f. the supplementary material for more details), and adopt results of GAN and mode regularized GAN (RegGAN) from [5].
For evaluation, we first train a simple, yet effective 3layer convolutional nets^{2}^{2}2Network architecture is similar to https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py. that can obtain 0.65% error on MNIST testing set, and then employ it to predict the label probabilities and compute MODE scores for generated samples. Fig. 3 (left) shows the distributions of MODE scores obtained by three models. Clearly, our proposed D2GAN significantly outperforms the standard GAN and RegGAN by achieving scores mostly around the maximum [8.09.0]. It is worthy to note that we did not observe substantial differences in the average MODE scores obtained by varying the network size through the parameter searching. We here report the result of the minimal network with the smallest number of layers and hidden units.
To study the effect of and , we inspect the results obtained by this minimal network with varied in Fig. 3 (right). There is a pattern that, given a fixed , our D2GAN obtains better MODE score when increasing to a certain value, after which the score could significantly decrease.
Mnist1k.
The standard MNIST data with 10mode assumption seems to be fairly trivial. Hence, based on this data, we test our proposed model on a more challenging one. We continue following the technique used in [5, 18] to construct a new 1000class MNIST dataset (MNIST1K) by stacking three randomly selected digits to form an RGB image with a different digit image in each channel. The resulting data can be assumed to contain 1,000 distinct modes, corresponding to the combinations of digits in 3 channels from 000 to 999.
In this experiment, we use a more powerful model with convolutional layers for discriminators and transposed convolutions for the generator. We measure the performance by the number of modes for which the model generated at least one in total 25,600 samples, and the reverse KL divergence between the model distribution (i.e., the label distribution predicted by the pretrained MNIST classifier used in the previous experiment) and the expected data distribution. Tab. 1 reports the results of our D2GAN compared with those of GAN, UnrolledGAN taken from [18], DCGAN and RegGAN from [5]. Our proposed method again clearly demonstrates the superiority over baselines by covering all modes and achieving the best distance that is close to zero.
4.2.3 Natural scene images
We now extend our experiments to investigate the scalability of our proposed method on much more challenging largescale image databases from natural scenes. We use three widelyadopted datasets: CIFAR10 [15], STL10 [6] and ImageNet [23]. CIFAR10 is a wellstudied dataset of 50,000 3232 training images of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. STL10, a subset of ImageNet, contains about 100,000 unlabeled 9696 images, which is more diverse than CIFAR10, but less so than the full ImageNet. We rescale all images down 3 times and train our networks on 3232 resolution. ImageNet is a very large database of about 1.2 million natural images from 1,000 classes, normally used as the most challenging benchmark to validate the scalability of deep models. We follow the preprocessing in [16], except subsampling to 3232 resolution. We use the code provided in [24] to compute the Inception score for 10 independent partitions of 50,000 generated samples.
Tab. 2 and Fig. 4 show the Inception scores on CIFAR10, STL10 and ImageNet datasets obtained by our model and baselines collected from recent work in literature. It is worthy to note that we only compare with methods trained in a completely unsupervised manner without label information. As the result, there exist 8 baselines on CIFAR10 whilst only DCGAN [21] and denoising feature matching (DFM) [30] are available on STL10 and ImageNet. We use our own TensorFlow implementation of DCGAN with the same network architecture with our model for fair comparisons. In all 3 experiments, the D2GAN fails to beat the DFM, but outperforms other baselines by large margins. The lower results compared with DFM suggest that using autoencoders for matching highlevel features appears to be an effective way to encourage the diversity. This technique is compatible with our method, thus integrating it could be a promising avenue for our future work.
Finally, we show several samples generated by our proposed model trained on these three datasets in Fig. 5. Samples are fair random draws, not cherrypicked. It can be seen that our D2GAN is able to produce visually recognizable images of cars, trucks, boats, horses on CIFAR10. The objects are getting harder to recognize, but the shapes of airplanes, cars, trucks and animals still can be identified on STL10, and images with various backgrounds such as sky, underwater, mountain, forest are shown on ImageNet. This confirms the diversity of samples generated by our model.



5 Conclusion
To summarize, we have introduced a novel approach to combine KullbackLeibler (KL) and reverse KL divergences in a unified objective function of the density estimation problem. Our idea is to exploit the complementary statistical properties of two divergences to improve both the quality and diversity of samples generated from the estimator. To that end, we propose a novel framework based on generative adversarial nets (GANs), which formulates a minimax game of three players: two discriminators and one generator, thus termed dual discriminator GAN (D2GAN). Given two discriminators fixed, the learning of generator moves towards optimizing both KL and reverse KL divergences simultaneously, and thus can help avoid mode collapse, a notorious drawback of GANs.
We have established extensive experiments to demonstrate the effectiveness and scalability of our proposed approach using synthetic and largescale realworld datasets. Compared with the latest stateoftheart baselines, our model is more scalable, can be trained on the largescale ImageNet dataset, and obtains Inception scores lower than those of the combination of denoising autoencoder and GAN (DFM), but significantly higher than the others. Finally, we note that our method is orthogonal and could integrate techniques in those baselines such as semisupervised learning [24], conditional architectures [19, 7, 22] and autoencoder [5, 30].
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
 [4] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 [5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 [6] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of singlelayer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 215–223, 2011.
 [7] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems (NIPS), pages 1486–1494, 2015.
 [8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 [10] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in 27th Neural Information Processing Systems (NIPS), pages 2672–2680. Curran Associates, Inc., 2014.
 [11] Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. CoRR, 2017.
 [12] Somesh Das Gupta and Jun Shao. Mathematical statistics, 2000.
 [13] Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
 [14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 1(4), 2009.
 [16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), volume 2, pages 1097–1105, Lake Tahoe, United States, December 3–6 2012. printed;.
 [17] Yann Lecun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten digits. 1998.
 [18] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 [19] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [20] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 271–279. Curran Associates, Inc., 2016.
 [21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [22] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In Proceedings of The 33rd International Conference on Machine Learning (ICML), volume 3, 2016.
 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [24] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NIPS), pages 2226–2234, 2016.
 [25] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
 [26] Asa BenHur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. Support Vector Clustering. In Journal of Machine Learning Research (JMLR), pages 125–137, 2001.
 [27] Trung Le, Hung Vu, Tu Dinh Nguyen, and Dinh Phung. Geometric Enclosing Networks. arXiv preprint arXiv:1708.04733, 2017.
 [28] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 [29] Ruohan Wang, Antoine Cully, Hyung Jin Chang, and Yiannis Demiris. Magan: Margin adaptation for generative adversarial networks. arXiv preprint arXiv:1704.03817, 2017.
 [30] D WardeFarley and Y Bengio. Improving generative adversarial networks with denoising feature matching. ICLR submissions, 8, 2017.
 [31] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. MultiGenerator Generative Adversarial Nets. arXiv preprint arXiv:1708.02556, 2017.