Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step
Abstract
Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players’ parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step.
1 Introduction
Generative adversarial networks (GANs) [8] are generative models based on a competition between a generator network and a discriminator network . The generator network represents a probability distribution . To obtain a sample from this distribution, we apply the generator network to a noise vector sampled from , that is . Typically, is drawn from a Gaussian or uniform distribution, but any distribution with sufficient diversity is possible. The discriminator attempts to distinguish whether an input value is real (came from the training data) or fake (came from the generator).
The goal of the training process is to recover the true distribution that generated the data. Several variants of the GAN training process have been proposed. Different variants of GANs have been interpreted as approximately minimizing different divergences or distances between and . However, it has been difficult to understand whether the improvements are caused by a change in the underlying divergence or the learning dynamics.
We conduct several experiments to assess whether the improvements associated with new GAN methods are due to the reasons cited in their design motivation. We perform a comprehensive study of GANs on simplified, synthetic tasks for which the true is known and the relevant distances are straightforward to calculate, to assess the performance of proposed models against baseline methods. We also evaluate GANs using several independent evaluation measures on real data to better understand new approaches. Our contributions are:

We aim to clarify terminology used in recent papers, where the terms “standard GAN,” “regular GAN,” or “traditional GAN” are used without definition (e.g., [2, 5, 22, 6]). The original GAN paper described two different losses: the “minimax” loss and the “nonsaturating” loss, equations (10) and (13) of Goodfellow [7], respectively. Recently, it has become important to clarify this terminology, because many of the criticisms of “standard GANs”, e.g. Arjovsky et al. [2], are applicable only to the minimax GAN, while the nonsaturating GAN is the standard for GAN implementations. The nonsaturating GAN was recommended for use in practice and implemented in the original paper of Goodfellow et al. [8], and is the default in subsequent papers [19, 22, 6, 17]^{1}^{1}1 The original GAN paper implements both the minimax and nonsaturating cost but uses the nonsaturating cost for the published configurations of experiments: https://github.com/goodfeli/adversarial/blob/master/cifar10_convolutional.yaml#L139. To the best of our knowledge, the DCGAN codebase implements only the nonsaturating cost: https://github.com/soumith/dcgan.torch/blob/master/main.lua#L215. Likewise, the improvedgan codebase implements only the nonsaturating cost: https://github.com/openai/improvedgan/blob/master/imagenet/build_model.py#L114. If only one of these two costs were to be called “standard,” it should be the nonsaturating version. . To avoid confusion we will always indicate whether we mean minimax GAN (MGAN) or nonsaturating GAN (NSGAN).

We demonstrate that gradient penalties designed in the divergence minimization framework—to improve Wasserstein GANs [9] or justified from a game theory perspective to improve minimax GANs [12]—also improve the nonsaturating GAN on both synthetic and real data. We observe improved sample quality and diversity.

We find that nonsaturating GANs are able to fit problems that cannot be fit by JensenShannon divergence minimization. Specifically, Figure 1 shows a GAN using the loss from the original nonsaturating GAN succeeding on a task where the JensenShannon divergence provides no useful gradient. Figure 2 shows that the nonsaturating GAN does not suffer from vanishing gradients when applied to two widely separated Gaussian distributions.
2 Variants of Generative Adversarial Networks
2.1 NonSaturating and Minimax GANs
In the original GAN formulation [8], the output of the discriminator is a probability and the cost function for the discriminator is given by the negative loglikelihood of the binary discrimination task of classifying samples as real or fake:
(1) 
The theoretical analysis in [8] is based on a zerosum game in which the generator maximizes , a situation that we refer to here as “minimax GANs”. In minimax GANs the generator attempts to generate samples that have low probability of being fake, by minimizing the objective (2). However, in practice, Goodfellow et al. [8] recommend implementing an alternative cost function that instead ensures that generated samples have high probability of being real, and the generator instead minimizes an alternative objective (3).
Minimax  (2)  
Nonsaturating  (3) 
We refer to the alternative objective as nonsaturating, due to the nonsaturating behavior of the gradient (see figure 2), and was the implementation used in the code of the original paper. We use the nonsaturating objective (3) in all our experiments
2.2 Wasserstein GAN
Wasserstein GANs [2] modify the discriminator to emit an unconstrained real number rather than a probability (analogous to emitting the logits rather than the probabilities used in the original GAN paper). The cost function for the WGAN then omits the logsigmoid functions used in the original GAN paper. The cost function for the discriminator is now:
(4) 
The cost function for the generator is simply . When the discriminator is Lipschitz smooth, this approach approximately minimizes the earth mover’s distance between and . To enforce Lipschitz smoothness, the weights of are clipped to lie within where is some small real number.
2.3 Gradient Penalties for Generative Adversarial Networks
Multiple formulations of gradient penalties have been proposed for GANs. As introduced in Gulrajani et al. [9], the gradient penalty is justified from the perspective of the Wasserstein distance, by imposing properties which hold for an optimal critic as an additional training criterion. In this approach, the gradient penalty is typically a penalty on the gradient norm, and is applied on a linear interpolation between data points and samples, thus smoothing out the space between the two distributions.
Kodali et al. [12] introduce a gradient penalty from the perspective of regret minimization, by setting the regularization function to be a gradient penalty on points around the data manifold, as in Follow The Regularized Leader [3], a standard noregret algorithm. This encourages the discriminator to be close to linear around the data manifold, thus bringing the set of possible discriminators closer to a convex set, the set of linear functions. We also note that they used the minimax version of the game to define the loss, in which the generator maximizes rather than minimizing .
To formalize the above, both proposed gradient penalties of the form:
(5) 
where is defined as the distribution defined by the sampling process:
(6) 
DRAGAN  (7a)  
WGANGP  (7b) 
(8)  
(9) 
As we will note in our experimental section, Kodali et al. [12] also reported that modecollapse is reduced using their version of the gradient penalty.
2.3.1 Nonsaturating GAN with Gradient Penalty
We consider the nonsaturating GAN objective (3) supplemented by two gradient penalties: the penalty proposed by Gulrajani et al. [9], which we refer to as “GANGP”; the gradient penalty proposed by DRAGAN [12], which we refer to as DRAGANNS, to emphasize that we use the nonsaturating generator loss function. In both cases, the gradient penalty applies only to the discriminator, with the generator loss remaining unchanged (as defined in Equation 3). In this setting, the loss of the discriminator becomes:
(10) 
We consider these GAN variants because:

We want to assess whether gradient penalties are effective outside their original defining scope. Namely, we perform experiments to determine whether the benefit obtained by applying the gradient penalty for Wasserstein GANs is obtained from properties of the earth mover’s distance, or from the penalty itself. Similarly, we evaluate whether the DRAGAN gradient penalty is beneficial outside the minimax GAN setting.

We want to assess whether the exact form of the gradient penalty matters.

We compare three models, to control over different aspects of training: same gradient penalty but different underlying adversarial losses (GANGP versus WGANGP), as well as the same underlying adversarial loss, but different gradient penalties (GANGP versus DRAGANNS).
We note that we do not compare with the original DRAGAN formulation, which uses the minimax GAN formulation, since in this work we focus on nonsaturating GAN variants.
3 Many Paths to Equilibrium
The original GAN paper [8] used the correspondence between and the JensenShannon divergence to characterize the Nash equilibrium of minimax GANs. It is important to keep in mind that there are many ways for the learning process to approach this equilibrium point, and the majority of them do not correspond to gradually reducing the JensenShannon divergence at each step. Divergence minimization is useful for understanding the outcome of training, but GAN training is not the same thing as running gradient descent on a divergence and GAN training may not encounter the same problems as gradient descent applied to a divergence.
Arjovsky et al. [2] describe the learning process of GANs from the perspective of divergence minimization and show that the JensenShannon divergence is unable to provide a gradient that will bring and together if both are sharp manifolds that do not overlap early in the learning process. Following this line of reasoning, they suggest that when applied to probability distributions that are supported only on low dimensional manifolds, the Kullback Leibler (KL), Jensen Shannon (JS) and Total Variation (TV) distance metrics do not provide a useful gradient for learning algorithms based on gradient descent, the “traditional GANs” is inappropriate for fitting such low dimensional manifolds (“traditional GAN” seems to refer to the minimax version of GANs used for theoretical analysis in the original paper, and there is no explicit statement about whether the argument is intended to apply to the nonsaturating GAN implemented in the code accompanying the original GAN paper). In Section 4 we show that nonsaturating GANs are able to learn on tasks where the data distribution lies on the low dimensional manifold.
We show that nonsaturating GANs do not suffer from vanishing gradients for two widely separated Gaussians in Figure 2. The fact that the gradient of the recommended loss does not actually vanish explains why GANs with the nonsaturating objective (3), are able to bring together two widely separated Gaussian distributions. Note that the gradient for this loss does not vanish even when the discriminator is optimal. The discriminator has vanishing gradients but the generator loss amplifies small differences in discriminator outputs to recover strong gradients. This means it is possible to train the GAN by changing the loss rather than the discriminator.
For the parallel lines thought experiment [2] (see Figure 1), the main problem with the JensenShannon divergence is that it is parameterized in terms of the density function, and the two density functions have no support in common. Most GANs, and many other models, can solve this problem by parameterizing their loss functions in terms of samples from the two distributions rather than in terms of their density functions.
4 Synthetic Experiments
To assess the learning process of GANs we empirically examine GAN training on pathological tasks where the data is constructed to lie on a low dimensional manifold, and show the model is able to learn the data distribution in cases where using the underlying divergence obtained at optimality would not provide useful gradients. We then evaluate convergence properties of common GAN variants on this task where the parameters generating the distribution are known.
4.1 Experiment I: 1D Data Manifold and 1D generator
In our first experiment, we generate synthetic training data that lies along a onedimensional line and design a onedimensional generative model, however, we embed the problem in a higher dimensional space where . This experiment is essentially an implementation of a thought experiment from Arjovsky et al. [2].
Specifically, in a dimensional space, we define by randomly generating parameters defining the distribution once at the beginning of the experiment. We generate a random and random . Our latent where is the standard deviation of the normal distribution. The synthetic training data of examples is then given by
(11) 
The real synthetic data is therefore Gaussian distributed on a 1D surface within the space, where the position is determined by and the orientation is determined by .
The generator also assumes the same functional form, that is, it is also intrinsically one dimensional,
(12) 
where and . The discriminator is a single hidden layer ReLU network, which is of higher complexity than the generator so that it may learn nonlinear boundaries in the space.
This experiment captures the idea of sharp, nonoverlapping manifolds that motivate alternative GAN losses. Further, because we know the true generating parameters of the training data, we may explicitly test convergence properties of the various methodologies.
4.2 Experiment II: 1D Data Manifold and overcomplete generator
In our second experiment, the synthetic training data is still the same (lying on a 1D line) and given by Eq. 11 but now the generator is overcomplete for this task, and has a higher latent dimension , where .
(13) 
where matrix and vector , so that the generator is able to represent a manifold with too high of a dimensionality. The generator parameterizes a multivariate Gaussian ) with . The covariance matrix elements . In vector notation, .
4.3 Results
To evaluate the convergence of an experimental trial, we report the squared error (squared norm) between the known Gaussian parameters generating the synthetic data and the fitted generator Gaussian parameters. In our notation, the subscript denotes real data and the subscript denotes the generator Gaussian parameters: and .
Every GAN variant was trained for 200000 iterations, and 5 discriminator updates were done for each generator update.
The main conclusions from our synthetic data experiments are:

Despite the inability of JensenShannon divergence minimization to solve this problem, we find that the nonsaturating GAN succeeds in converging to the 1D data manifold (Figure 3). However, in higher dimensions the resulting fit is not as strong as the other methods: Figure 11 shows that increasing the number of dimensions while keeping the learning rate fixed can decrease the performance of the nonsaturating GAN model.

Nonsaturating GANs are able to learn data distributions which are disjoint from the training sample distribution at initialization (or another point in training), as demonstrated in Figure 1.

Updating the discriminator 5 times per generator update does not result in vanishing gradients, when using the nonsaturating cost. This dissipates the notion that a strong discriminator would not provide useful gradients during training.

An overcapacity generator with the ability to have more directions of high variance than the underlying data is able to capture the data distribution using nonsaturating GAN training (Figure 4).






4.4 Hyperparameter sensitivity
We assess the robustness of the considered models by looking at results across hyperparameters for both experiment 1 and experiment 2. In one setting, we keep the input dimension fixed while varying the learning rate (Figure 10); in another setting, we keep the learning rate fixed, while varying the input dimension (Figure 11). In both cases, the results are averaged out over 1000 runs per setting, each starting from a different random seed. We notice that:

The nonsaturating GAN model (with no gradient penalty) is most sensitive to hyperparameters.

Gradient penalties make the nonsaturating GAN model more robust.

Both Wasserstein GAN formulations are quite robust to hyperparameter changes.

For certain hyperparameter settings, there is no performance difference between the two gradient penalties for the nonsaturating GAN, when averaging across random seeds. This is especially visible in Experiment 1, when the number of latent variables is 1. This could be due to the fact that the data sits on a low dimensional manifold, and because the discriminator is a small, shallow network.
5 Real data experiments
To assess the effectiveness of the gradient penalty on standard datasets for the nonsaturating GAN formulation, we train a nonsaturating GAN, a nonsaturating GAN with the gradient penalty introduced by [9] (denoted by GANGP), a nonsaturating GAN with the gradient penalty introduced by [12] (denoted by DRAGANNS), and a Wasserstein GAN with gradient penalty (WGANGP) on three datasets: Color MNIST [15]  data dimensionality , CelebA [14]  data dimensionality and CIFAR10 [13]  data dimensionality , as seen in Figure 7.
For all our experiments we used as the gradient penalty coefficient and used batch normalization [10]; Kodali et al. [12] suggests that batch normalization is not neeeded for DRAGAN, but we found that it also improved our DRAGANNS results. We used the Adam optimizer [11] with and and a batch size of 64. The input data was scaled to be between 1 and 1. We did not add any noise to the discriminator inputs or activations, as that regularization technique can be interpreted as having the same goal as gradient penalties, and we wanted to avoid a confounding factor. We trained all Color MNIST models for 100000 iterations, and CelebA and CIFAR10 models for 200000 iterations. We note that the experimental results on real data for the nonsaturating GAN and for the Improved Wasserstein GAN (WGANGP) are quoted with permission from an earlier publication by Rosca et al. [20].
We note that the WGANGP model was the only model for which we did 5 discriminator updates in real data experiments. All other models (DCGAN, DRAGANNS, GANGP) used one discriminator update for generator update.
For all reported results, we sweep over two hyperparameters:

Learning rates for the discriminator and generator. Following Radford et al. [19], we tried learning rates of 0.0001, 0.0002, 0.0003 for both the discriminator and the generator. We note that this is consistent with WGANGP, where the authors use 0.0002 for CIFAR10 experiments.

Number of latents. For CelebA and CIFAR10 we try latent sizes 100, 128 and 150, while for Color MNIST we try 10, 50, 75.
5.1 Evaluation
Unlike the synthetic case, here we are unable to evaluate the performance of our models relative to the true solution, since that is unknown. Moreover, there is no single metric that can evaluate the performance of GANs. We thus complement visual inspection with three metrics, each measuring a different criteria related to model performance. We use the Inception Score [22] to measure how visually appealing CIFAR10 samples are, the MSSSIM metric [24, 18] to check sample diversity, and an Improved Wasserstein independent critic to assess overfitting, as well as sample quality [4]. For a more detailed explanation of these metrics, we refer to Rosca et al. [20]. In all our experiments, we control over discriminator and generator architectures, using the ones used by DCGAN [19] and the original WGAN paper [2]^{2}^{2}2Code at: https://github.com/martinarjovsky/WassersteinGAN/blob/master/models/dcgan.py. We note that the WGANGP paper used a different architecture when reporting the Inception Score on CIFAR10, and thus their results are not directly comparable.
For all the metrics, we report both the hyperparameter sensitivity of the model (by showing quartile statistics), as well as the 10 best results according to the metric. The sample diversity measure needs to be seen in context with the value reported on the test set: too high diversity can mean failure to capture the data distribution. For all other metrics, higher is better.
5.2 Visual sample inspection
By visually inspecting the results of our models, we noticed that applying gradient penalties to the nonsaturating GAN results in more stable training across the board. When training the nonsaturating GAN with no gradient penalty, we did observe cases of severe mode collapse (see Figure 14). Gradient penalties improves upon that, but we can still observe mode collapse. Each nonsaturating GAN variant with gradient penalty (DRAGANNS and GANGP) only produced mode collapse on one dataset, see Figure 18). We also noticed that for certain learning rates, WGANGPs fail to learn the data distribution (Figure 19). For the GANGP and DRAGANNS models, most hyperparameters produced samples of equal quality  the models are quite robust. We show samples from the GANGP, DRAGANNS and WGANGP models in Figures 15, 16 and 17.
5.3 Metrics
We show that gradient penalties make nonsaturating GANs more robust to hyperparameter changes. For this, we report not only the best obtained results, but rather a box plot of the obtained results showing the quartiles obtained by each sweep, along with the top 10 best results explicitly shown in the graph (note that for each model we tried 27 different hyperparameter settings, corresponding to 3 discriminator learning rates, 3 generator learning rates and 3 generator input sizes). We report two Inception Score metrics for CIFAR10, one using the standard Inception network used when the metric was introduced [22], trained on the Imagenet dataset, as well as a VGG style network trained on CIFAR10 (for details on the architecture, we refer the reader to Rosca et al. [20]). We report the former to be compatible with existing literature, and the latter to obtain a more meaningful metric, since the network doing the evaluation was trained on the same dataset as the one we evaluate, hence the learned features will be more relevant for the task at hand. When reporting sample diversity, we subtract the average pairwise image similarity (as reported by MSSSIM) computed as the mean of the similarity of every pair of images from 5 batches from the test set. Note that we can only apply this measure to CelebA, since for datasets such as CIFAR10 different classes are represented by very different images, making this metric meaningless across class borders. Since our models are completely unsupervised, we do not compute the similarity across samples of the same class as in [18]. The Inception Score and sample diversity metric results can be seen in Figure 9. The results obtained using the Independent Wasserstein critic on all datasets can be found in Figure 8.
5.4 Key takeaways from real data experiments
When analyzing the results obtained by training nonsaturating GANs using gradient penalties (GANGP and DRAGANNS), we notice that:

Both gradient penalties help when training nonsaturating GANs, by making the models more robust to hyperparameters.

On CelebA, for various hyperparameter settings WGANGP fails to learn the data distribution and produces samples that do not look like faces (Figure 19). This results in a higher sample diversity than the reference diversity obtained on the test set, as reported by our diversity metric  see Figure 9(a) which compares sample diversity for the considered models across hyperparameters. The same figure shows that for most hyperparameter values, the WGANGP model produces higher diversity than the one obtained on the test set (indicating failure to capture the data distribution), while for most hyperparameters nonsaturating GAN variants produce samples with lower diversity than that of the test set (indicating mode collapse). However, WGANGP is closer to the reference value for more hyperparameters, compared to the nonsaturating GAN variants.

Even if we are only interested in the best results (without looking across the hyperparameter sweep), we see that the gradient penalties tend to improve results for nonsaturating GANs.

The nonsaturating GAN trained with gradient penalties produces better samples which give better Inception Scores, both when looking at the results obtained from the best set of hyperparameters and when looking at the entire sweep.

While the nonsaturating GAN variants are much faster to train than the WGANGP model (since we do only one discriminator update per generator update), they perform similarly to the WGANGP model. Thus, nonsaturating GANs with penalties offer a better computation versus performance tradeoff. When we trained WGANGP models in which we update the discriminator only once per generator update, we noticed a decrease in sample quality for all datasets, reflected by our reported metrics, as seen in Figure 12.

When looking at the independent Wasserstein critic results, we see that the WGANGP models perform best on Color MNIST and CIFAR10. However, on CelebA the Independent Wasserstein Critic can distinguish between validation data examples and samples from the model (see Figure 8(b)). This is consistent with what we have seen by examining samples: the hyperparameters which result in samples of reduced quality are the same with a reduced negative Wasserstein distance.

The sample diversity metric and the Independent Wasserstein critic detect mode collapse. When DRAGANNS collapses for two hyperparameter settings, the negative Wasserstein distance reported by the critic for these jobs is low, showing that the critic captures the difference in distributions, and the sample diversity reported for those settings is greatly reduced (Figure 13).
6 Discussion
We have shown that viewing the training dynamics of GANs through the lens of the underlying divergence at optimality can be misleading. On lowdimensional synthetic problems, we showed that nonsaturating GANs are able to learn the true data distribution where JensenShannon divergence minimization would fail. We also showed that gradient penalty regularizers help improve the training dynamics and robustness of nonsaturating GANs. It is worth noting that one of the gradient penalty regularizers was originally proposed for Wasserstein GANs, motivated by properties of the Wasserstein distance; evaluating nonsaturating GANs with similar gradient penalty regularizers helps disentangle the improvements arising from optimizing a different divergence (or distance) and the improvements from better training dynamics.
Comparison between explored gradient penalties:
As described in Section 2.3, we have evaluated two gradient
penalties on nonsaturating GANs. We now turn our attention to the distinction
between the two gradient penalties. We have already noted that for a few hyperparameter settings, DRAGANNS produced samples with mode collapse, while the GANGP model did not. By looking at the resulting metrics, we note that there is no clear winner between the two types of gradient penalties. To assess whether the two penalties have a different regularization effect, we also tried applying both (with a gradient penalty coefficient of 10 for both, or of 5 for both), but that did not result in better models. This could be because the two penalties have a very similar effect, or due to optimization considerations (they might conflict with each other).
Other gradient penalties:
Besides the gradient penalties explored in this work, several other regularizers have been proposed for stabilizing GAN training.
Roth et al. [21] proposed a gradient penalty aiming to smooth the discriminator of GANs (including the minimax GAN), which we refer to as GANGP,
inspired by Sønderby et al. [23] and Arjovsky and Bottou [1]. Their gradient penalty is different from the ones explored here; specifically, their gradient penalty is weighted by the square of the discriminator’s probability of real for each data instance and the penalty is applied to data and samples (no noise is added).
In FisherGAN [16], an equality constraint that is added on the
magnitude of the output of the discriminator on data as well as samples is
directly penalized, as opposed to the magnitude of the discriminator
gradients, as in WGANGP. Similar to WGANGP, the penalty was introduced in
the framework of integral probability metrics, but it can be directly applied
to other approaches to GAN training. Unlike WGANGP, Fisher GAN uses augmented Lagrangians to impose the equality
constraint, instead of a penalty method. To the best of
our knowledge, this has not been tried yet and we leave it for future work.
The regularizers assessed in this work (the penalties proposed by DRAGAN and WGANGP), as well as others (such as GANGP and FisherGAN) are similar in spirit, but have been proposed from distinct theoretical considerations. Future study of GAN regularizers will determine how these regularizers interact, and help us understand the mechanism by which they stabilize GAN training and motivate new approaches.
Acknowledgements
We thank Ivo Danihelka and Jascha SohlDickstein for helpful feedback and discussions.
References
 Arjovsky and Bottou [2017] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
 Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 CesaBianchi and Lugosi [2006] N. CesaBianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 Danihelka et al. [2017] I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and GANbased training of real NVPs. arXiv preprint arXiv:1705.05263, 2017.
 Denton et al. [2015] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 Donahue et al. [2016] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Goodfellow [2016] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kodali et al. [2017] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train your DRAGAN. arXiv preprint arXiv:1705.07215, 2017.
 Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Liu et al. [2015] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Metz et al. [2016] L. Metz, B. Poole, D. Pfau, and J. SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 Mroueh and Sercu [2017] Y. Mroueh and T. Sercu. Fisher GAN. arXiv preprint arXiv:1705.09675, 2017.
 Nowozin et al. [2016] S. Nowozin, B. Cseke, and R. Tomioka. fGAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 Odena et al. [2016] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. arXiv preprint arXiv:1610.09585, 2016.
 Radford et al. [2015] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Rosca et al. [2017] M. Rosca, B. Lakshminarayanan, D. WardeFarley, and S. Mohamed. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
 Roth et al. [2017] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. arXiv preprint arXiv:1705.09367, 2017.
 Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. arXiv preprint arXiv:1606.03498, 2016.
 Sønderby et al. [2016] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP inference for image superresolution. arXiv preprint arXiv:1610.04490, 2016.
 Wang et al. [2003] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the ThirtySeventh Asilomar Conference on, volume 2, pages 1398–1402. IEEE, 2003.