Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

Abstract

Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players’ parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step.

1Introduction

Generative adversarial networks (GANs) [8] are generative models based on a competition between a generator network and a discriminator network . The generator network represents a probability distribution . To obtain a sample from this distribution, we apply the generator network to a noise vector sampled from , that is . Typically, is drawn from a Gaussian or uniform distribution, but any distribution with sufficient diversity is possible. The discriminator attempts to distinguish whether an input value is real (came from the training data) or fake (came from the generator).

The goal of the training process is to recover the true distribution that generated the data. Several variants of the GAN training process have been proposed. Different variants of GANs have been interpreted as approximately minimizing different divergences or distances between and . However, it has been difficult to understand whether the improvements are caused by a change in the underlying divergence or the learning dynamics.

We conduct several experiments to assess whether the improvements associated with new GAN methods are due to the reasons cited in their design motivation. We perform a comprehensive study of GANs on simplified, synthetic tasks for which the true is known and the relevant distances are straightforward to calculate, to assess the performance of proposed models against baseline methods. We also evaluate GANs using several independent evaluation measures on real data to better understand new approaches. Our contributions are:

  • We aim to clarify terminology used in recent papers, where the terms “standard GAN,” “regular GAN,” or “traditional GAN” are used without definition (e.g., [2]). The original GAN paper described two different losses: the “minimax” loss and the “non-saturating” loss, equations (10) and (13) of [7], respectively. Recently, it has become important to clarify this terminology, because many of the criticisms of “standard GANs”, e.g. [2], are applicable only to the minimax GAN, while the non-saturating GAN is the standard for GAN implementations. The non-saturating GAN was recommended for use in practice and implemented in the original paper of [8], and is the default in subsequent papers [19]1. To avoid confusion we will always indicate whether we mean minimax GAN (M-GAN) or non-saturating GAN (NS-GAN).

  • We demonstrate that gradient penalties designed in the divergence minimization framework—to improve Wasserstein GANs [9] or justified from a game theory perspective to improve minimax GANs [12]—also improve the non-saturating GAN on both synthetic and real data. We observe improved sample quality and diversity.

  • We find that non-saturating GANs are able to fit problems that cannot be fit by Jensen-Shannon divergence minimization. Specifically, Figure ? shows a GAN using the loss from the original non-saturating GAN succeeding on a task where the Jensen-Shannon divergence provides no useful gradient. Figure 1 shows that the non-saturating GAN does not suffer from vanishing gradients when applied to two widely separated Gaussian distributions.

2Variants of Generative Adversarial Networks

2.1Non-Saturating and Minimax GANs

In the original GAN formulation [8], the output of the discriminator is a probability and the cost function for the discriminator is given by the negative log-likelihood of the binary discrimination task of classifying samples as real or fake:

The theoretical analysis in [8] is based on a zero-sum game in which the generator maximizes , a situation that we refer to here as “minimax GANs”. In minimax GANs the generator attempts to generate samples that have low probability of being fake, by minimizing the objective (Equation 2). However, in practice, [8] recommend implementing an alternative cost function that instead ensures that generated samples have high probability of being real, and the generator instead minimizes an alternative objective ( ?).

We refer to the alternative objective as non-saturating, due to the non-saturating behavior of the gradient (see Figure 1), and was the implementation used in the code of the original paper. We use the non-saturating objective ( ?) in all our experiments

As shown in [8], whenever successfully minimizes optimally, maximizing with respect to the generator is equivalent to minimizing the Jensen-Shannon divergence. [8] use this observation to establish that there is a unique Nash equilibrium in function space corresponding to .

2.2Wasserstein GAN

Wasserstein GANs [2] modify the discriminator to emit an unconstrained real number rather than a probability (analogous to emitting the logits rather than the probabilities used in the original GAN paper). The cost function for the WGAN then omits the log-sigmoid functions used in the original GAN paper. The cost function for the discriminator is now:

The cost function for the generator is simply . When the discriminator is Lipschitz smooth, this approach approximately minimizes the earth mover’s distance between and . To enforce Lipschitz smoothness, the weights of are clipped to lie within where is some small real number.

2.3Gradient Penalties for Generative Adversarial Networks

Multiple formulations of gradient penalties have been proposed for GANs. As introduced in [9], the gradient penalty is justified from the perspective of the Wasserstein distance, by imposing properties which hold for an optimal critic as an additional training criterion. In this approach, the gradient penalty is typically a penalty on the gradient norm, and is applied on a linear interpolation between data points and samples, thus smoothing out the space between the two distributions.

[12] introduce a gradient penalty from the perspective of regret minimization, by setting the regularization function to be a gradient penalty on points around the data manifold, as in Follow The Regularized Leader [3], a standard no-regret algorithm. This encourages the discriminator to be close to linear around the data manifold, thus bringing the set of possible discriminators closer to a convex set, the set of linear functions. We also note that they used the minimax version of the game to define the loss, in which the generator maximizes rather than minimizing .

To formalize the above, both proposed gradient penalties of the form:

where is defined as the distribution defined by the sampling process:

As we will note in our experimental section, [12] also reported that mode-collapse is reduced using their version of the gradient penalty.

Non-saturating GAN with Gradient Penalty

We consider the non-saturating GAN objective ( ?) supplemented by two gradient penalties: the penalty proposed by [9], which we refer to as “GAN-GP”; the gradient penalty proposed by DRAGAN [12], which we refer to as DRAGAN-NS, to emphasize that we use the non-saturating generator loss function. In both cases, the gradient penalty applies only to the discriminator, with the generator loss remaining unchanged (as defined in Equation ?). In this setting, the loss of the discriminator becomes:

We consider these GAN variants because:

  • We want to assess whether gradient penalties are effective outside their original defining scope. Namely, we perform experiments to determine whether the benefit obtained by applying the gradient penalty for Wasserstein GANs is obtained from properties of the earth mover’s distance, or from the penalty itself. Similarly, we evaluate whether the DRAGAN gradient penalty is beneficial outside the minimax GAN setting.

  • We want to assess whether the exact form of the gradient penalty matters.

  • We compare three models, to control over different aspects of training: same gradient penalty but different underlying adversarial losses (GAN-GP versus WGAN-GP), as well as the same underlying adversarial loss, but different gradient penalties (GAN-GP versus DRAGAN-NS).

We note that we do not compare with the original DRAGAN formulation, which uses the minimax GAN formulation, since in this work we focus on non-saturating GAN variants.

Figure 1:  (Left) A recreation of Figure 2 of . This figure is used by  to show that a model they call the traditional GAN suffers from vanishing gradients in the areas where D(x) is flat. This plot is correct if traditional GAN is used to refer to the minimax GAN, but it does not apply to the non-saturating GAN. (Right) A plot of both generator losses from the original GAN paper, as a function of the generator output. Even when the model distribution is highly separated from the data distribution, non-saturating GANs are able to bring the model distribution closer to the data distribution because the loss function has strong gradient when the generator samples are far from the data samples, even when the discriminator itself has nearly zero gradient. While it is true that the \frac{1}{2} \log( 1 - D(x)) loss has a vanishing gradient on the right half of the plot, the original GAN paper instead recommends implementing -\frac{1}{2} \log D(x). This latter, recommended loss function has a vanishing gradient only on the left side of the plot. It makes sense for the gradient to vanish on the left because generator samples in that area have already reached the area where data samples lie.
Figure 1: (Left) A recreation of Figure 2 of . This figure is used by to show that a model they call the “traditional GAN” suffers from vanishing gradients in the areas where is flat. This plot is correct if “traditional GAN” is used to refer to the minimax GAN, but it does not apply to the non-saturating GAN. (Right) A plot of both generator losses from the original GAN paper, as a function of the generator output. Even when the model distribution is highly separated from the data distribution, non-saturating GANs are able to bring the model distribution closer to the data distribution because the loss function has strong gradient when the generator samples are far from the data samples, even when the discriminator itself has nearly zero gradient. While it is true that the loss has a vanishing gradient on the right half of the plot, the original GAN paper instead recommends implementing . This latter, recommended loss function has a vanishing gradient only on the left side of the plot. It makes sense for the gradient to vanish on the left because generator samples in that area have already reached the area where data samples lie.

3Many Paths to Equilibrium

The original GAN paper [8] used the correspondence between and the Jensen-Shannon divergence to characterize the Nash equilibrium of minimax GANs. It is important to keep in mind that there are many ways for the learning process to approach this equilibrium point, and the majority of them do not correspond to gradually reducing the Jensen-Shannon divergence at each step. Divergence minimization is useful for understanding the outcome of training, but GAN training is not the same thing as running gradient descent on a divergence and GAN training may not encounter the same problems as gradient descent applied to a divergence.

[2] describe the learning process of GANs from the perspective of divergence minimization and show that the Jensen-Shannon divergence is unable to provide a gradient that will bring and together if both are sharp manifolds that do not overlap early in the learning process. Following this line of reasoning, they suggest that when applied to probability distributions that are supported only on low dimensional manifolds, the Kullback Leibler (KL), Jensen Shannon (JS) and Total Variation (TV) distance metrics do not provide a useful gradient for learning algorithms based on gradient descent, the “traditional GANs” is inappropriate for fitting such low dimensional manifolds (“traditional GAN” seems to refer to the minimax version of GANs used for theoretical analysis in the original paper, and there is no explicit statement about whether the argument is intended to apply to the non-saturating GAN implemented in the code accompanying the original GAN paper). In Section 4 we show that non-saturating GANs are able to learn on tasks where the data distribution lies on the low dimensional manifold.

We show that non-saturating GANs do not suffer from vanishing gradients for two widely separated Gaussians in Figure 1. The fact that the gradient of the recommended loss does not actually vanish explains why GANs with the non-saturating objective ( ?), are able to bring together two widely separated Gaussian distributions. Note that the gradient for this loss does not vanish even when the discriminator is optimal. The discriminator has vanishing gradients but the generator loss amplifies small differences in discriminator outputs to recover strong gradients. This means it is possible to train the GAN by changing the loss rather than the discriminator.

For the parallel lines thought experiment [2] (see Figure ?), the main problem with the Jensen-Shannon divergence is that it is parameterized in terms of the density function, and the two density functions have no support in common. Most GANs, and many other models, can solve this problem by parameterizing their loss functions in terms of samples from the two distributions rather than in terms of their density functions.

4Synthetic Experiments

To assess the learning process of GANs we empirically examine GAN training on pathological tasks where the data is constructed to lie on a low dimensional manifold, and show the model is able to learn the data distribution in cases where using the underlying divergence obtained at optimality would not provide useful gradients. We then evaluate convergence properties of common GAN variants on this task where the parameters generating the distribution are known.

4.1Experiment I: 1-D Data Manifold and 1-D generator

In our first experiment, we generate synthetic training data that lies along a one-dimensional line and design a one-dimensional generative model, however, we embed the problem in a higher -dimensional space where . This experiment is essentially an implementation of a thought experiment from [2].

Specifically, in a -dimensional space, we define by randomly generating parameters defining the distribution once at the beginning of the experiment. We generate a random and random . Our latent where is the standard deviation of the normal distribution. The synthetic training data of examples is then given by

The real synthetic data is therefore Gaussian distributed on a 1-D surface within the space, where the position is determined by and the orientation is determined by .

The generator also assumes the same functional form, that is, it is also intrinsically one dimensional,

where and . The discriminator is a single hidden layer ReLU network, which is of higher complexity than the generator so that it may learn non-linear boundaries in the space.

This experiment captures the idea of sharp, non-overlapping manifolds that motivate alternative GAN losses. Further, because we know the true generating parameters of the training data, we may explicitly test convergence properties of the various methodologies.

4.2Experiment II: 1-D Data Manifold and overcomplete generator

In our second experiment, the synthetic training data is still the same (lying on a 1-D line) and given by Equation 4 but now the generator is overcomplete for this task, and has a higher latent dimension , where .

where matrix and vector , so that the generator is able to represent a manifold with too high of a dimensionality. The generator parameterizes a multivariate Gaussian ) with . The covariance matrix elements . In vector notation, .

4.3Results

To evaluate the convergence of an experimental trial, we report the squared error (squared norm) between the known Gaussian parameters generating the synthetic data and the fitted generator Gaussian parameters. In our notation, the subscript denotes real data and the subscript denotes the generator Gaussian parameters: and .

Every GAN variant was trained for 200000 iterations, and 5 discriminator updates were done for each generator update.

The main conclusions from our synthetic data experiments are:

  • Gradient penalties (both applied near the data manifold, DRAGAN-NS, and at an interpolation between data and samples, GAN-GP) stabilize training and improve convergence (Figures ?, Figure 5, Figure 6).

  • Despite the inability of Jensen-Shannon divergence minimization to solve this problem, we find that the non-saturating GAN succeeds in converging to the 1D data manifold (Figure ?). However, in higher dimensions the resulting fit is not as strong as the other methods: Figure 6 shows that increasing the number of dimensions while keeping the learning rate fixed can decrease the performance of the non-saturating GAN model.

  • Non-saturating GANs are able to learn data distributions which are disjoint from the training sample distribution at initialization (or another point in training), as demonstrated in Figure ?.

  • Updating the discriminator 5 times per generator update does not result in vanishing gradients, when using the non-saturating cost. This dissipates the notion that a strong discriminator would not provide useful gradients during training.

  • An over-capacity generator with the ability to have more directions of high variance than the underlying data is able to capture the data distribution using non-saturating GAN training (Figure ?).

4.4Hyperparameter sensitivity

We assess the robustness of the considered models by looking at results across hyperparameters for both experiment 1 and experiment 2. In one setting, we keep the input dimension fixed while varying the learning rate (Figure Figure 5); in another setting, we keep the learning rate fixed, while varying the input dimension (Figure Figure 6). In both cases, the results are averaged out over 1000 runs per setting, each starting from a different random seed. We notice that:

  • The non-saturating GAN model (with no gradient penalty) is most sensitive to hyperparameters.

  • Gradient penalties make the non-saturating GAN model more robust.

  • Both Wasserstein GAN formulations are quite robust to hyperparameter changes.

  • For certain hyperparameter settings, there is no performance difference between the two gradient penalties for the non-saturating GAN, when averaging across random seeds. This is especially visible in Experiment 1, when the number of latent variables is 1. This could be due to the fact that the data sits on a low dimensional manifold, and because the discriminator is a small, shallow network.

Figure 2: Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.
Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.
Figure 2: Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.
Figure 3: Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.
Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.
Figure 3: Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants. For reference, we also plot the average error obtained by a randomly initialized generator with the same architecture as the trained generators. Lower values are better.

5Real data experiments

To assess the effectiveness of the gradient penalty on standard datasets for the non-saturating GAN formulation, we train a non-saturating GAN, a non-saturating GAN with the gradient penalty introduced by [9] (denoted by GAN-GP), a non-saturating GAN with the gradient penalty introduced by [12] (denoted by DRAGAN-NS), and a Wasserstein GAN with gradient penalty (WGAN-GP) on three datasets: Color MNIST [15] - data dimensionality , CelebA [14] - data dimensionality and CIFAR-10 [13] - data dimensionality , as seen in Figure 4.

Figure 4: Examples from the three datasets explored in this paper: Color MNIST (left), CIFAR-10 (middle) and CelebA (right).
Examples from the three datasets explored in this paper: Color MNIST (left), CIFAR-10 (middle) and CelebA (right).
Examples from the three datasets explored in this paper: Color MNIST (left), CIFAR-10 (middle) and CelebA (right).
Figure 4: Examples from the three datasets explored in this paper: Color MNIST (left), CIFAR-10 (middle) and CelebA (right).

For all our experiments we used as the gradient penalty coefficient and used batch normalization [10]; [12] suggests that batch normalization is not neeeded for DRAGAN, but we found that it also improved our DRAGAN-NS results. We used the Adam optimizer [11] with and and a batch size of 64. The input data was scaled to be between -1 and 1. We did not add any noise to the discriminator inputs or activations, as that regularization technique can be interpreted as having the same goal as gradient penalties, and we wanted to avoid a confounding factor. We trained all Color MNIST models for 100000 iterations, and CelebA and CIFAR-10 models for 200000 iterations. We note that the experimental results on real data for the non-saturating GAN and for the Improved Wasserstein GAN (WGAN-GP) are quoted with permission from an earlier publication by [20].

We note that the WGAN-GP model was the only model for which we did 5 discriminator updates in real data experiments. All other models (DCGAN, DRAGAN-NS, GAN-GP) used one discriminator update for generator update.

For all reported results, we sweep over two hyperparameters:

  • Learning rates for the discriminator and generator. Following [19], we tried learning rates of 0.0001, 0.0002, 0.0003 for both the discriminator and the generator. We note that this is consistent with WGAN-GP, where the authors use 0.0002 for CIFAR-10 experiments.

  • Number of latents. For CelebA and CIFAR-10 we try latent sizes 100, 128 and 150, while for Color MNIST we try 10, 50, 75.

5.1Evaluation

Unlike the synthetic case, here we are unable to evaluate the performance of our models relative to the true solution, since that is unknown. Moreover, there is no single metric that can evaluate the performance of GANs. We thus complement visual inspection with three metrics, each measuring a different criteria related to model performance. We use the Inception Score [22] to measure how visually appealing CIFAR-10 samples are, the MS-SSIM metric [24] to check sample diversity, and an Improved Wasserstein independent critic to assess overfitting, as well as sample quality [4]. For a more detailed explanation of these metrics, we refer to [20]. In all our experiments, we control over discriminator and generator architectures, using the ones used by DCGAN [19] and the original WGAN paper [2]2. We note that the WGAN-GP paper used a different architecture when reporting the Inception Score on CIFAR10, and thus their results are not directly comparable.

For all the metrics, we report both the hyperparameter sensitivity of the model (by showing quartile statistics), as well as the 10 best results according to the metric. The sample diversity measure needs to be seen in context with the value reported on the test set: too high diversity can mean failure to capture the data distribution. For all other metrics, higher is better.

5.2Visual sample inspection

By visually inspecting the results of our models, we noticed that applying gradient penalties to the non-saturating GAN results in more stable training across the board. When training the non-saturating GAN with no gradient penalty, we did observe cases of severe mode collapse (see Figure 7). Gradient penalties improves upon that, but we can still observe mode collapse. Each non-saturating GAN variant with gradient penalty (DRAGAN-NS and GAN-GP) only produced mode collapse on one dataset, see Figure 11). We also noticed that for certain learning rates, WGAN-GPs fail to learn the data distribution (Figure Figure 12). For the GAN-GP and DRAGAN-NS models, most hyperparameters produced samples of equal quality - the models are quite robust. We show samples from the GAN-GP, DRAGAN-NS and WGAN-GP models in Figures Figure 8, Figure 9 and Figure 10.

5.3Metrics

We show that gradient penalties make non-saturating GANs more robust to hyperparameter changes. For this, we report not only the best obtained results, but rather a box plot of the obtained results showing the quartiles obtained by each sweep, along with the top 10 best results explicitly shown in the graph (note that for each model we tried 27 different hyperparameter settings, corresponding to 3 discriminator learning rates, 3 generator learning rates and 3 generator input sizes). We report two Inception Score metrics for CIFAR-10, one using the standard Inception network used when the metric was introduced [22], trained on the Imagenet dataset, as well as a VGG style network trained on CIFAR-10 (for details on the architecture, we refer the reader to [20]). We report the former to be compatible with existing literature, and the latter to obtain a more meaningful metric, since the network doing the evaluation was trained on the same dataset as the one we evaluate, hence the learned features will be more relevant for the task at hand. When reporting sample diversity, we subtract the average pairwise image similarity (as reported by MS-SSIM) computed as the mean of the similarity of every pair of images from 5 batches from the test set. Note that we can only apply this measure to CelebA, since for datasets such as CIFAR-10 different classes are represented by very different images, making this metric meaningless across class borders. Since our models are completely unsupervised, we do not compute the similarity across samples of the same class as in [18]. The Inception Score and sample diversity metric results can be seen in Figure ?. The results obtained using the Independent Wasserstein critic on all datasets can be found in Figure ?.

5.4Key takeaways from real data experiments

When analyzing the results obtained by training non-saturating GANs using gradient penalties (GAN-GP and DRAGAN-NS), we notice that:

  • Both gradient penalties help when training non-saturating GANs, by making the models more robust to hyperparameters.

  • On CelebA, for various hyperparameter settings WGAN-GP fails to learn the data distribution and produces samples that do not look like faces (Figure Figure 12). This results in a higher sample diversity than the reference diversity obtained on the test set, as reported by our diversity metric - see Figure ? which compares sample diversity for the considered models across hyperparameters. The same figure shows that for most hyperparameter values, the WGAN-GP model produces higher diversity than the one obtained on the test set (indicating failure to capture the data distribution), while for most hyperparameters non-saturating GAN variants produce samples with lower diversity than that of the test set (indicating mode collapse). However, WGAN-GP is closer to the reference value for more hyperparameters, compared to the non-saturating GAN variants.

  • Even if we are only interested in the best results (without looking across the hyperparameter sweep), we see that the gradient penalties tend to improve results for non-saturating GANs.

  • The non-saturating GAN trained with gradient penalties produces better samples which give better Inception Scores, both when looking at the results obtained from the best set of hyperparameters and when looking at the entire sweep.

  • While the non-saturating GAN variants are much faster to train than the WGAN-GP model (since we do only one discriminator update per generator update), they perform similarly to the WGAN-GP model. Thus, non-saturating GANs with penalties offer a better computation versus performance tradeoff. When we trained WGAN-GP models in which we update the discriminator only once per generator update, we noticed a decrease in sample quality for all datasets, reflected by our reported metrics, as seen in Figure ?.

  • When looking at the independent Wasserstein critic results, we see that the WGAN-GP models perform best on Color MNIST and CIFAR-10. However, on CelebA the Independent Wasserstein Critic can distinguish between validation data examples and samples from the model (see Figure ?). This is consistent with what we have seen by examining samples: the hyperparameters which result in samples of reduced quality are the same with a reduced negative Wasserstein distance.

  • The sample diversity metric and the Independent Wasserstein critic detect mode collapse. When DRAGAN-NS collapses for two hyperparameter settings, the negative Wasserstein distance reported by the critic for these jobs is low, showing that the critic captures the difference in distributions, and the sample diversity reported for those settings is greatly reduced (Figure ?).

6Discussion

We have shown that viewing the training dynamics of GANs through the lens of the underlying divergence at optimality can be misleading. On low-dimensional synthetic problems, we showed that non-saturating GANs are able to learn the true data distribution where Jensen-Shannon divergence minimization would fail. We also showed that gradient penalty regularizers help improve the training dynamics and robustness of non-saturating GANs. It is worth noting that one of the gradient penalty regularizers was originally proposed for Wasserstein GANs, motivated by properties of the Wasserstein distance; evaluating non-saturating GANs with similar gradient penalty regularizers helps disentangle the improvements arising from optimizing a different divergence (or distance) and the improvements from better training dynamics.
Comparison between explored gradient penalties: As described in Section 2.3, we have evaluated two gradient penalties on non-saturating GANs. We now turn our attention to the distinction between the two gradient penalties. We have already noted that for a few hyperparameter settings, DRAGAN-NS produced samples with mode collapse, while the GAN-GP model did not. By looking at the resulting metrics, we note that there is no clear winner between the two types of gradient penalties. To assess whether the two penalties have a different regularization effect, we also tried applying both (with a gradient penalty coefficient of 10 for both, or of 5 for both), but that did not result in better models. This could be because the two penalties have a very similar effect, or due to optimization considerations (they might conflict with each other).
Other gradient penalties: Besides the gradient penalties explored in this work, several other regularizers have been proposed for stabilizing GAN training. [21] proposed a gradient penalty aiming to smooth the discriminator of -GANs (including the minimax GAN), which we refer to as -GAN-GP, inspired by [23] and [1]. Their gradient penalty is different from the ones explored here; specifically, their gradient penalty is weighted by the square of the discriminator’s probability of real for each data instance and the penalty is applied to data and samples (no noise is added). In Fisher-GAN [16], an equality constraint that is added on the magnitude of the output of the discriminator on data as well as samples is directly penalized, as opposed to the magnitude of the discriminator gradients, as in WGAN-GP. Similar to WGAN-GP, the penalty was introduced in the framework of integral probability metrics, but it can be directly applied to other approaches to GAN training. Unlike WGAN-GP, Fisher GAN uses augmented Lagrangians to impose the equality constraint, instead of a penalty method. To the best of our knowledge, this has not been tried yet and we leave it for future work.
The regularizers assessed in this work (the penalties proposed by DRAGAN and WGAN-GP), as well as others (such as -GAN-GP and Fisher-GAN) are similar in spirit, but have been proposed from distinct theoretical considerations. Future study of GAN regularizers will determine how these regularizers interact, and help us understand the mechanism by which they stabilize GAN training and motivate new approaches.

Acknowledgements

We thank Ivo Danihelka and Jascha Sohl-Dickstein for helpful feedback and discussions.

Figure 5: Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the learning rate while keeping the input dimension fixed. Results averaged over 1000 runs. Lower values are better.
Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the learning rate while keeping the input dimension fixed. Results averaged over 1000 runs. Lower values are better.
Figure 5: Synthetic Experiment 1. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the learning rate while keeping the input dimension fixed. Results averaged over 1000 runs. Lower values are better.
Figure 6: Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the input dimensions while keeping the learning rate fixed. Results averaged over 1000 runs. Lower values are better.
Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the input dimensions while keeping the learning rate fixed. Results averaged over 1000 runs. Lower values are better.
Figure 6: Synthetic Experiment 2. The squared norm difference between the generated Gaussian parameters and true Gaussian parameters for different GAN variants, when varying the input dimensions while keeping the learning rate fixed. Results averaged over 1000 runs. Lower values are better.
Figure 7: Examples of mode collapse obtained for some hyperparameter settings with non-saturating GAN.
Figure 7: Examples of mode collapse obtained for some hyperparameter settings with non-saturating GAN.
Figure 8: CIFAR-10 samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CIFAR-10 samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CIFAR-10 samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 8: CIFAR-10 samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 9: CelebA samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CelebA samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CelebA samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 9: CelebA samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 10: CMNIST samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CMNIST samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
CMNIST samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 10: CMNIST samples obtained from the GAN-GP, DRAGAN-NS, and WGAN-GP models.
Figure 11: Mode collapse when adding gradient penalties to non-saturating GANs. GAN-GP only had two instances of mode collapse, namely color mode collapse on Color-MNIST (left), while DRAGAN-NS only had two instances of mode collapse, which ocurred when trained on CelebA (right and middle).
Mode collapse when adding gradient penalties to non-saturating GANs. GAN-GP only had two instances of mode collapse, namely color mode collapse on Color-MNIST (left), while DRAGAN-NS only had two instances of mode collapse, which ocurred when trained on CelebA (right and middle).
Mode collapse when adding gradient penalties to non-saturating GANs. GAN-GP only had two instances of mode collapse, namely color mode collapse on Color-MNIST (left), while DRAGAN-NS only had two instances of mode collapse, which ocurred when trained on CelebA (right and middle).
Figure 11: Mode collapse when adding gradient penalties to non-saturating GANs. GAN-GP only had two instances of mode collapse, namely color mode collapse on Color-MNIST (left), while DRAGAN-NS only had two instances of mode collapse, which ocurred when trained on CelebA (right and middle).
Figure 12: Examples of failure to capture the data distribution with WGAN-GP. The model puts too much mass around the data distribution when trained on the CelebA dataset.
Examples of failure to capture the data distribution with WGAN-GP. The model puts too much mass around the data distribution when trained on the CelebA dataset.
Examples of failure to capture the data distribution with WGAN-GP. The model puts too much mass around the data distribution when trained on the CelebA dataset.
Figure 12: Examples of failure to capture the data distribution with WGAN-GP. The model puts too much mass around the data distribution when trained on the CelebA dataset.

Footnotes

  1. The original GAN paper implements both the minimax and non-saturating cost but uses the non-saturating cost for the published configurations of experiments: https://github.com/goodfeli/adversarial/blob/master/cifar10_convolutional.yaml#L139. To the best of our knowledge, the DCGAN codebase implements only the non-saturating cost: https://github.com/soumith/dcgan.torch/blob/master/main.lua#L215. Likewise, the improved-gan codebase implements only the non-saturating cost: https://github.com/openai/improved-gan/blob/master/imagenet/build_model.py#L114. If only one of these two costs were to be called “standard,” it should be the non-saturating version.
  2. Code at: https://github.com/martinarjovsky/WassersteinGAN/blob/master/models/dcgan.py

References

  1. Towards principled methods for training generative adversarial networks.
    M. Arjovsky and L. Bottou. arXiv preprint arXiv:1701.04862
  2. Wasserstein GAN.
    M. Arjovsky, S. Chintala, and L. Bottou. arXiv preprint arXiv:1701.07875
  3. Prediction, learning, and games


    N. Cesa-Bianchi and G. Lugosi. .
  4. Comparison of maximum likelihood and GAN-based training of real NVPs.
    I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. arXiv preprint arXiv:1705.05263
  5. Deep generative image models using a Laplacian pyramid of adversarial networks.
    E. L. Denton, S. Chintala, R. Fergus, et al. In Advances in neural information processing systems, pages 1486–1494, 2015.
  6. Adversarial feature learning.
    J. Donahue, P. Krähenbühl, and T. Darrell. arXiv preprint arXiv:1605.09782
  7. NIPS 2016 tutorial: Generative adversarial networks.
    I. Goodfellow. arXiv preprint arXiv:1701.00160
  8. Generative adversarial nets.
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. In NIPS, 2014.
  9. Improved training of Wasserstein GANs.
    I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. arXiv preprint arXiv:1704.00028
  10. Batch normalization: Accelerating deep network training by reducing internal covariate shift.
    S. Ioffe and C. Szegedy. In ICML, 2015.
  11. Adam: A method for stochastic optimization.
    D. Kingma and J. Ba. arXiv preprint arXiv:1412.6980
  12. How to train your DRAGAN.
    N. Kodali, J. Abernethy, J. Hays, and Z. Kira. arXiv preprint arXiv:1705.07215
  13. Learning multiple layers of features from tiny images.
    A. Krizhevsky. 2009.
  14. Deep learning face attributes in the wild.
    Z. Liu, P. Luo, X. Wang, and X. Tang. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  15. Unrolled generative adversarial networks.
    L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. arXiv preprint arXiv:1611.02163
  16. Fisher GAN.
    Y. Mroueh and T. Sercu. arXiv preprint arXiv:1705.09675
  17. f-GAN: Training generative neural samplers using variational divergence minimization.
    S. Nowozin, B. Cseke, and R. Tomioka. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
  18. Conditional image synthesis with auxiliary classifier GANs.
    A. Odena, C. Olah, and J. Shlens. arXiv preprint arXiv:1610.09585
  19. Unsupervised representation learning with deep convolutional generative adversarial networks.
    A. Radford, L. Metz, and S. Chintala. arXiv preprint arXiv:1511.06434
  20. Variational approaches for auto-encoding generative adversarial networks.
    M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. arXiv preprint arXiv:1706.04987
  21. Stabilizing training of generative adversarial networks through regularization.
    K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. arXiv preprint arXiv:1705.09367
  22. Improved techniques for training GANs.
    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. arXiv preprint arXiv:1606.03498
  23. Amortised MAP inference for image super-resolution.
    C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. arXiv preprint arXiv:1610.04490
  24. Multiscale structural similarity for image quality assessment.
    Z. Wang, E. P. Simoncelli, and A. C. Bovik. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398–1402. IEEE, 2003.
1008
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
""
The feedback must be of minumum 40 characters
Add comment
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question