New Losses for Generative Adversarial Learning

New Losses for Generative Adversarial Learning

Victor BERGER
TAU, CNRS-INRIA-LRI,
Univ. Paris Sud, Univ. Paris Saclay
victor.berger@inria.fr
&Michèle SEBAG
TAU, CNRS-INRIA-LRI,
Univ. Paris Sud, Univ. Paris Saclay
Michele.Sebag@lri.fr
Abstract

Generative Adversarial Networks (Goodfellow et al., 2014), a major breakthrough in the field of generative modeling, learn a discriminator to estimate some distance between the target and the candidate distributions.

This paper examines mathematical issues regarding the way the gradients for the generative model are computed in this context, and notably how to take into account how the discriminator itself depends on the generator parameters.

A unifying methodology is presented to define mathematically sound training objectives for generative models taking this dependency into account in a robust way, covering both GAN, VAE and some GAN variants as particular cases.

New Losses for Generative Adversarial Learning

Victor BERGER TAU, CNRS-INRIA-LRI, Univ. Paris Sud, Univ. Paris Saclay victor.berger@inria.fr Michèle SEBAG TAU, CNRS-INRIA-LRI, Univ. Paris Sud, Univ. Paris Saclay Michele.Sebag@lri.fr

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Generative modeling aims at a distribution best fitting the observational distribution . A main approach, exemplified by the Variational Auto-Encoder (VAE) [14], relies on maximizing the (log)-likelihood of the observational sample after . In Generative Adversarial Networks (GANs) [6], the log-likelihood optimization is replaced by a 2-sample test-like criterion: a discriminator is trained to discriminate between the generated and the observational samples, minimizing a discrimination loss such as:

 LD=Ex∼pdata[−log(D(x))]+Ex∼pθ[−log(1−D(x))] (1)

that admits an optimal solution analytically defined from as:

 D⋆(x)=pdata(x)pdata(x)+pθ(x) (2)

The generator is then trained to minimize a loss opposed to the discriminator:

 LG=−LD (3)

This loss depends on in two ways: the expectation on , and the value of . However, GAN and its many variants [6, 20, 2] only consider the first dependency when computing the gradient , differentiating the loss while considering independent of . This suggests that they may actually optimize another loss than intended (section 2).

A main contribution of the paper is to propose a mathematically sound methodology to transform any generative loss expressed as some divergence between the generative and the sample distributions, and involving a discriminator , such that the loss gradient properly accounts for the dependency w.r.t. (section 3). A first benefit thereof is to be able to analytically understand the optimization objective; a second benefit is the generality of the approach. Specifically, by applying this methodology to various divergences, it is shown that the proposed theoretical framework unifies and includes as particular cases several generative approaches ranging from GAN [6] and VAE [14] to some GAN variants [10]. Furthermore, this framework offers the same flexibility as -GAN [20] in minimizing an arbitrary divergence, and the defined loss is more robust w.r.t. the optimization process.

As a proof of concept, the proposed methodology is applied to the symmetric KL-divergence (SDKL) in section 4. The resulting loss yields a generative model with performances similar to VAEGAN [15], confirming that the proposed improvement in terms of mathematical interpretation comes at no cost in terms of performance compared to [21, 14, 15] (section 5).

2 Generative Modeling Losses: Formal Background and Discussion

This section briefly presents and discusses the training losses used in GAN, VAE and -GAN, and underlines the core mathematical issues with the way GAN gradients are derived.

As said previously, GANs [6] train a generative model given a dataset sampled after distribution , where is modelled as follows: i) a latent variable is sampled after a given prior ; ii) the deterministic transformation is applied to and yields sample . is trained in an adversarial fashion, i.e. using the opposite of the discriminator loss (Eq. 1). Assuming an optimal discriminator (Eq. 2), the discriminator loss (to be optimized by the generator) is the Jensen-Shannon divergence between the ground truth and the generator distributions, up to a constant:

 LD⋆=2log2−2JSD(pdata∥pθ) (4)

This loss is however difficult to use, mostly due to the fact that the Jensen-Shannon divergence saturates when the two distributions are very different, killing its gradient. Thus alternative objectives are considered, such as minimizing [6], which is closer to optimizing than the Jensen-Shannon divergence, and brings empirically better results. The dynamics of GANs have been the subject of extensive research over the past years, considering the learning dynamics of the model and their associated difficulties [1, 18], and defining alternative training objectives for the generator and the discriminator [23, 20, 2, 7]. However, it is still unclear whether any of these variants dominates the other ones [17].

2.2 Variational AutoEncoders (VAEs)

VAEs [14] learn model by maximizing the likelihood . As is hardly computable except for simple , VAEs consider a decomposition of the generative model where: i) is sampled from a fixed prior ; ii) is used to define distribution ; iii) is drawn according to distribution . The efficiency of the approach comes from defining as a relatively simple distribution for any , enabling to analytically compute . The expressive power of the model lies on the so-called probabilistic decoder, mapping on . This decoder is coupled with a probabilistic encoder , and trained using the evidence lower-bound (ELBO) criterion [14]:

 Ez∼q(z|x)[logpθ(x|z)]−DKL(q(z|x)∥ρ(z))≤logpθ(x) (5)

The left-hand term of Eq. 5 involves two terms, the reconstruction error of the auto-encoder and the KL divergence between the prior and the posterior distribution of , acting as a regularization on the latent representation. This loss can be analytically computed and optimized by gradient descent, and its maximization does increase the log-likelihood as intended.

In VAEs, a key design issue is the class of distributions for ; an inappropriate model class can ever prevent from efficiently approximating 111For instance, Gaussian models are often restricted to diagonal normal distributions in high dimensional spaces. In such cases, boils down to the Euclidean distance between and the output of the decoder. In the domain of image generation however, the use of an independent Gaussian noise on each pixel is a poor choice as natural images are known to have very spatially correlated features. This issue is addressed by replacing Gaussian with more sophisticated models, e.g. using perceptual distances based on another neural network [15, 9, 5], or modelling using autoregressive models [8].

2.3 f-Gan

-GAN [20] proposes a general framework to optimize any -divergence along a GAN setup, as a min-max game:

 (6)

Where is the Fenchel conjugate of , and represents the discriminator, learned depending on . This loss is maximized w.r.t. the parameter of the discriminator, and minimized w.r.t. the parameter of the generator: the minimization of the target divergence is obtained by reaching a saddle point.

2.4 Mathematical and algorithmic issues

As described in 2.1, GAN optimizes the generator parameter by gradient descent of the discriminator loss, which is computed in practice as if were independent of [6]. Still after Eq. 2, the gradient of w.r.t. reads:

 ∇θD⋆(x)=−pdata(x)(pdata(x)+pθ(x))2∇θpθ(x)=−D⋆(x)(1−D⋆(x))∇θlogpθ(x) (7)

As is not independent 222Unless is trivial: if is uniformly 0 or 1, its gradient is indeed 0, delivering no information to train the generator. This is however a GAN failure case [1]. of , the gradient of the generator loss reads:

 (8)

The expectation w.r.t.  does not disappear any more, and is difficult to estimate unless can analytically be computed, which is not the case in practice except for extremely simple models.

To try and sidestep this issues, algebraic calculations can be used to transform the expectation under  into an expectation under . It comes for all :

 Ex∼pdata[F(x)]=Ex∼pθ[pdata(x)pθ(x)F(x)]=Ex∼pθ[D⋆(x)1−D⋆(x)F(x)] (9)

yielding:

 ∇θLG=Ex∼pθ[log(1−D⋆(x))∇θlogpθ(x)] (10)

Interestingly, the original GAN generator loss[6] , yields the same gradient when differentiated while assuming that does not depend on .

Unfortunately, while Eq. 9 mathematically holds true, it is less so when the expectations are approximated from a handful of minibatch examples. This is particularly the case when  and are far from each other or even have different support, and would cause this estimation to have a very high variance[1].

Whether GAN training loss (Eq. 10) does minimize the JS divergence between and then becomes questionable. The same question arise concerning -GAN, that extensively uses expectation rewritings as in Eq. 9 to define the considered optimization objectives [20].

Considering a loss aimed at minimizing a divergence between two distributions, let be a discriminator approximating the ratio of both distributions. This section aims at constructing a loss which, when differentiated for training while taking independent of would still yield the correct gradient accounting the dependence.

 ∇θQ=Ex∼pθ[f(pθ(x)pdata(x))∇θlogpθ(x)]−Ex∼pdata[g(pθ(x)pdata(x))∇θlogpθ(x)] (11)

where and are some fixed functions (possibly if does not involve the associated expectation). Substituting with its estimate :

 ∇θQ=Ex∼pθ[f(1−D(x)D(x))∇θlogpθ(x)]−Ex∼pdata[g(1−D(x)D(x))∇θlogpθ(x)] (12)

Then, integrating the above gradient w.r.t. considering as a constant function of yields the loss :

 LP=Ex∼pθ[f(1−D(x)D(x))]−Ex∼pdata[g(1−D(x)D(x))logpθ(x)] (13)

By construction, (Eq. 12) i) can be computed directly using the neural network estimate of taken as independent of ; ii) is the proper gradient associated to loss ; iii) minimizes divergence . Note that the left-hand term in can be estimated by sampling , similarly to a GAN. The right-hand term (as long as is a positive function) can be approximated like in a VAE by using a higher bound like the ELBO (section 2.2). In brief, this methodology defines a mathematically sound loss for minimizing any divergence, with VAEs and GANs as particular cases.

An important point remains to consider: the model of . Indeed this distribution appears two times in Eq.13: once as the basis for an expectation, representing samples from the model in a GAN fashion, and once as a likelihood , from the VAE framework. It is crucial for the foundation of our theory that these two terms refer to the same .

However VAE and GAN consider a different model for : in both cases, a latent variable is sampled and decoded into a sample (which clearly hints at fusing the VAE decoder and the GAN generator); still, the decoder and the generator differ in slight but significant ways (Section 2). For the sake of a well-founded theoretical framework, both models thus need be reconciled. The proposed approach uses the VAE generative model in all cases, only providing the discriminator with generated samples from . This requirement adds some constraints regarding the model: while its likelihood must remain computable, it must also be possible to sample the distribution and backpropagate a gradient through this sampling [14, 11].

Though related to former approaches combining VAE and GANs [22, 15, 5], the proposed aproach differs in goal and method. In [22, 15, 5], the motivation is to address algorithmic issues. The proposed approach, aimed at a common theoretical framework unifying VAE and GAN learning dynamics, yields different models (Section 4).

Discussion w.r.t. f-Gan.

Like -GAN, the proposed approach aims at general formulations based on -divergences. A first difference is that -GAN extensively relies on Eq. (9) to change expectations over into expectations over , and vice-versa. -GAN thus faces some difficulties when both expectations are approximated by minibatch sampling on the one hand, and and have poor support overlap on the other hand (as explained in section 2.4). However such a poor overlap occurs in early learning phases, and possibly later on [1]. A second difference is that the -GAN generative model is only provided with the gradient information based on its own generated samples; unfortunately, this gradient provides little information unless already is a good approximation of .

The proposed approach avoids both pitfalls, by using VAE to approximate expectations over (with no further need to rewrite the expectations), and providing the generator with information based on both real data and generated samples.

3.2 A unifying framework

The loss rewriting is applied to three divergences, showing that the framework unifies approaches of the state of the art.

Minimizing DKL(pθ||pdata).

 (14)

then it comes:

 (15)

Most interestingly, this gradient is same as for the loss proposed by [10] (assuming again that does not depend on ), widely and quite successfully used to train GANs:

 LG=Ex∼pθ[log1−D(x)D(x)] (16)

Minimizing DKL(pdata||pθ).

 ∇θDKL(pdata||pθ)=Ex∼pdata[−∇θlogpθ(x)] (17)

This term is the same as the VAE loss [14], that is indeed very successful, and at the core of the proposed methodology. Note that tackling this loss within a GAN requires to swap the expectations (Eq. (9)):

 (18)

recovering the same gradient as the following loss (assuming again that does not depend on ):

 LG=Ex∼pθ[−D⋆(x)1−D⋆(x)] (19)

This loss does not only suffer from the already mentioned expectation swap issues. It is also hardly suited to gradient based optimization, as the expectation involves the discriminator value rather than its logarithm, suggesting that numerical instabilities will adversely affect the learning trajectory as the sigmoid activation of the discriminator saturates333This was empirically confirmed: as is close to in the early learning phase, gradients from the discriminator to the generator vanish, preventing any further progress. Using instead the -GAN formalism also yields negative results due to numerical instability and divergence of the models, as discussed in section 2.4.. For both reasons, VAE is better suited to this loss optimization.

Minimizing JSD(pθ||pdata).

 ∇θJSD(pθ||pdata)=Ex∼pdata[−(1−D⋆(x))∇θlogpθ(x)]+Ex∼pθ[(D⋆(x)+log(1−D⋆(x)))∇θlogpθ(x)] (20)

that coincides with the gradient of the following loss:

 LG=Ex∼pdata[−(1−D⋆(x))logpθ(x)]+Ex∼pθ[(D⋆(x)+log(1−D⋆(x)))] (21)

combining a VAE-like (left) term and a GAN-like (right) term. Expectedly, the above loss yields a difficult optimization problem, especially so in the early optimization stages. The left term side is drawn to 0 if is close to 1, and does not provide any gradient information to the VAE ELBO. The right-hand term suffers from the same issues as the original GAN loss, as provides very small gradients if is close to 0.

These issues are direct consequences of the saturation of the Jensen-Shannon divergence when the generator is not sufficiently good.

4 A Proof of Concept: The Symmetrized KL Divergence

Considering a symmetrical divergence between and is appealing. It thus comes naturally to consider the symmetrized Kullback-Leibler divergence, summing (generating very realistic samples but with a tendency to mode-dropping) and (ensuring a good coverage of all modes, but generating samples of lower quality). This section illustrates the proposed methodology on this divergence, yielding the so-called SDKL loss. The experimental setting used for its comparative empirical validation is thereafter described.

4.1 The SDKL loss

The SDKL loss derived from the symmetrized Kullback-Leibler divergence reads:

 LG=Ex∼pθ[log1−D(x)D(x)]+Ex∼pdata[−logpθ(x)] (22)

This loss is trained by combining VAE and GAN, as for the Jensen-Shannon divergence. While this loss is similar to the VAEGAN [15], the training procedure (Algorithm 1) differs in two respects. Firstly, VAEGAN uses samples generated from the raw output of the decoder, rather than sampling them from . As a consequence, the loss actually considers two different models in lieu of a single one, without unifying them as discussed in section 3.1. Secondly, the VAEGAN discriminator is trained from the reconstructed images; it does no longer learn to approximate , as reconstructed samples differ from samples generated from . Finally, whenever the discriminator is powerful, the VAEGAN might suffer from large training instabilities after the following proposition:

Proposition 1.

A discriminator accurately discriminating between a real sample and its reconstruction by the VAE is prone to cause exploding gradients for the loss as gets close to .

Proof: In appendix A.

Note that the VAEGAN loss, , is equivalent to when is close to , and thus suffers of the same numerical issues: when facing a powerful discriminator VAEGAN is subject to the risk of exploding gradients when considering the reconstructed samples.

4.2 Experimental setting

The proposed SDKL is empirically validated compared to VAEGAN. The goal of experiments is twofold. The first comparison regards the choice of generator model: the theoretically sound option is to sample after prior , and then sample from as done in SDKL; the VAEGAN-like option (referred to as SDKL-without sampling) is to sample in and compute with no further sampling. The second question regards the impact of using reconstructed samples to both train the discriminator and apply the GAN generative loss on the reconstructed samples; this impact is assessed by comparing VAEGAN and SDKL-without sampling.

Experiments are conducted on the thoroughly studied Celeba dataset [16], the code for reproducing them is provided online444Repository containing the experiment code: https://gitlab.inria.fr/vberger/new-losses-gan.. Simple convolutional networks are used for both encoder and discriminator, and a simple deconvolutional network is used for the decoder. All networks use Elu activations [4], detailed architecture is given in appendix B. The optimizer used for all three models is ADAM [12]. Its parameters are , and for the Encoder and Decoder network, and and the same and for the Discriminator network. The discriminator is trained for minibatches between each generator update.555Note that setting ensures that remains an acceptable approximation of , though it comes at the risk of vanishing gradient.

The compared models thus only differ in terms of the training objectives. They include: i) a GAN; ii) VAE using an adversarial perceptual loss [9, 5]; iii) VAEGAN [15]; iv) the SDKL proof of concept model; v) SDKL where the samples are not sampled from , but from .

All models are defined using a perceptual loss [15, 9, 5]. As discussed in section 2.2, the Gaussian noise is applied in the second internal convolution layer (as opposed to, directly on the image pixels), to yield a more relevant noise model and improving . This noise is also used for the VAE (where the discriminator is only trained to define the perceptual distance).

For all models that require sampling , this sampling is emulated by adding the corresponding Gaussian noise in the second layer of the discriminator. The discriminator discriminates real samples from generated samples, or reconstructed samples. It includes an Orthogonal Regularization [3], ensuring that for every -th layer, its internal activations are of the same order of magnitude, and ensuring that each layer be -Lipschitz with a reasonably small value of .

5 Empirical Analysis

This section compares the dynamics of the discriminator and losses of all five models (Fig. 1), and discusses the results based on the generated images (Fig. 2; more in supplementary material).

5.1 Dynamics of the Losses

The discriminator loss along the learning trajectory, depicted in Fig. 1.Left, is interpreted in terms of the pressure exerted by the generator on the discriminator.

The first striking observation is that SDKL does not challenge its discriminator in any way. Indeed the discriminator only needs to recognize samples after from original samples. Even though is enriched using a perceptual distance, it involves a Gaussian noise in the final (or intermediate) layers, making easy the discriminator task. As the discriminator only needs to recognize the presence of Gaussian noise to classify the image, it thus provides a poor feedback to the generator since the considered generative model cannot help but including a Gaussian noise. More sophisticated models for [8] are thus needed, outside of the current paper scope as described in section 3.1, and left for future work.

Interestingly, GAN and VAEGAN exert much more pressure on their discriminator than VAE and SDKL. Indeed the GAN has for single objective to fool its discriminator. The VAEGAN discriminator has to discriminate between real and reconstructed images, a much more difficult task than for the SDKL discriminator since the reconstructed images are overall of a much higher quality that the generated ones. VAE faces a mixed loss, and the discriminator loss is not necessarily the dominant one. Empirically, the reconstruction loss is higher than the discrimination loss by orders of magnitude, which is mostly explained from the poor quality of model . On the other hand the VAE is never directly trained to fool the discriminator, hence the relatively easy task of its adversary. SDKL-without sampling yields a moderate discriminator loss: on the one hand the discriminator loss is not its only goal; on the other hand, the discriminator task is far more difficult than for SDKL, as the generated samples do not suffer from the Gaussian noise.

The analysis of the losses confirms the previous interpretation. SDKL has a good compression score, which is expected given how much detail is lost to its noninformative discriminator. Both VAEGAN and SDKL-without sampling need to store more information in their latent space than the VAE, consequence of the GAN feedback on their generation.

5.2 Qualitative comparisons

The qualitative inspection of the generated samples (Fig. 1.Right) first shows that SDKL suffers from the blurry effects long associated to VAEs, which is blamed on the poor quality of the discriminator: SDKL therefore cannot take advantage of its feedback nor of the adversarial loss. Regarding GAN, as expected it generates sharp images at the cost of some loss of diversity of the generated images (the mode-dropping effect).

Both VAEGAN and SDKL-without sampling improve on SDKL and VAE with perceptual distance, generating more detailed images, though still blurrier than the GAN.

6 Discussion and Perspectives

The contribution of the paper is a new theoretical methodology to derive training objectives for generative modelling, leveraging both the VAE’s ELBO to estimate densities and the GAN discriminator to estimate . This methodology can deal with a large class of divergences and yield mathematically sound and algorithmically efficient losses, avoiding mathematical improprieties in dealing with or approximating expectations based on minibatch sampling.

A proof of concept of the proposed framework is established for the symmetrized Kullback-Leibler divergence. Its experimental validation shows that this theoretical improvement comes are no cost in terms of performance compared with state of the art approaches.

The main perspective for further research concerns the choice of more powerful model spaces for , particularly so when considering natural images, as Gaussian observation models are ill-suited to model such data. More sophisticated model spaces, e.g. inspired from [8], can hardly be used as is in the proposed framework due to its requirements on , and thus would require further work to be used. Improving the inference model of the VAE-based part of our approach using more elaborate models [13, 19] will also improve the ELBO estimation, and the overall model quality.

References

• [1] Martin Arjovsky and Léon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. arXiv:1701.04862 [cs, stat], January 2017. arXiv: 1701.04862.
• [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv:1701.07875 [cs, stat], January 2017. arXiv: 1701.07875.
• [3] Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. Neural Photo Editing with Introspective Adversarial Networks. arXiv:1609.07093 [cs, stat], September 2016. arXiv: 1609.07093.
• [4] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289 [cs], November 2015. arXiv: 1511.07289.
• [5] Alexey Dosovitskiy and Thomas Brox. Generating Images with Perceptual Similarity Metrics based on Deep Networks. arXiv:1602.02644 [cs], February 2016. arXiv: 1602.02644.
• [6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
• [7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved Training of Wasserstein GANs. arXiv:1704.00028 [cs, stat], March 2017. arXiv: 1704.00028.
• [8] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. PixelVAE: A Latent Variable Model for Natural Images. arXiv:1611.05013 [cs], November 2016. arXiv: 1611.05013.
• [9] Xianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep Feature Consistent Variational Autoencoder. arXiv:1610.00291 [cs], October 2016. arXiv: 1610.00291.
• [10] Ferenc Huszár. An Alternative Update Rule for Generative Adversarial Networks, 2018.
• [11] Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax. arXiv:1611.01144 [cs, stat], November 2016. arXiv: 1611.01144.
• [12] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. arXiv: 1412.6980.
• [13] Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving Variational Inference with Inverse Autoregressive Flow. arXiv:1606.04934 [cs, stat], June 2016. arXiv: 1606.04934.
• [14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat], December 2013. arXiv: 1312.6114.
• [15] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300 [cs, stat], December 2015. arXiv: 1512.09300.
• [16] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
• [17] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs Created Equal? A Large-Scale Study. arXiv:1711.10337 [cs, stat], November 2017. arXiv: 1711.10337.
• [18] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The Numerics of GANs. arXiv:1705.10461 [cs], May 2017. arXiv: 1705.10461.
• [19] Eric Nalisnick and Padhraic Smyth. Stick-Breaking Variational Autoencoders. arXiv:1605.06197 [stat], May 2016. arXiv: 1605.06197.
• [20] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. June 2016.
• [21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 [cs], November 2015. arXiv: 1511.06434.
• [22] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational Approaches for Auto-Encoding Generative Adversarial Networks. arXiv:1706.04987 [cs, stat], June 2017. arXiv: 1706.04987.
• [23] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based Generative Adversarial Network. arXiv:1609.03126 [cs, stat], September 2016. arXiv: 1609.03126.

Appendix A Proof for Proposition 1

Proposition.

A discriminator that can well discriminate between a real sample and its reconstruction by the VAE is prone to exploding gradients for the loss as gets close to .

Proof.

Assuming the discriminator discriminates well between and , meaning there exists some such that and .

Assuming this discriminator is a neural network, we decompose its last layer like so ( being the sigmoid function):

 D(x)=σ(→w⋅→ϕ(x)+b)

We further assume that the previous layers, represented by are, as a whole, a K-lipschitz function, which is a trivial assumption given the discriminator is a neural network.

Now, we can compute :

 ∇xlog1−D(x)D(x)=∇x−(→w⋅→ϕ(x)+b)=−→w⋅∇x→ϕ(x)

Furthermore:

 D(x)≥1−ϵ⇔→w⋅→ϕ(x)+b≥log1−ϵϵ
 D(^x)≤ϵ⇔→w⋅→ϕ(ˆx)+b≤logϵ1−ϵ

As a result:

 |→w⋅(→ϕ(ˆx)−→ϕ(x))|≥2log1−ϵϵ

And notably:

 ∥→w∥≥2log1−ϵϵ∥ϕ(^x)−ϕ(x)∥≥2log1−ϵϵK∥^x−x∥

By hypothesis, , and assuming K has been chosen tight enough, in general we can expect .

Putting all this together, we can expect in general to have:

 ∥∥∥∇xlog1−D(x)D(x)∥∥∥∼2log1−ϵϵ∥^x−x∥

Which can grow arbitrarily large as gets closer to if the discriminator remains good. ∎

Appendix B Architecture of the used neural networks

c.5 Sdkl

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters