Approximability of Discriminators Implies Diversity in GANs

# Approximability of Discriminators Implies Diversity in GANs

Yu Bai Department of Statistics, Stanford University. yub@stanford.edu    Tengyu Ma Facebook AI Research. tengyuma@stanford.edu    Andrej Risteski MIT, Applied Mathematics and IDSS. risteski@mit.edu
###### Abstract

While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al. [2] suggests a dilemma about GANs’ statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse.

In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.

## 1 Introduction

In the past few years, we have witnessed a great empirical success of Generative Adversarial Networks (GANs) [12] on generating high-quality examples. Various ideas have been proposed to further improve the quality of the learned distributions and the stability of the training. (See e.g.,  [1, 24, 15, 26, 33, 28, 16, 9, 13] and the reference therein.)

However, understanding of GANs is still in its infancy. Do GANs actually learn the target distribution? Recent work [2, 3, 8] has both theoretically and empirically brought the concern to light that distributions learned by GANs suffer from mode collapse or lack of diversity — the learned distribution tend to miss a significant amount of modes of the target distribution. The main message of this paper is that the mode collapse can be in principle alleviated by designing proper discriminators with strong distinguishing power against specific families of generators (such as special subclasses of neural network generators.)

### 1.1 Background on mode collapse in GANs

We mostly focus on the Wasserstein GAN (WGAN) formulation [1] in this paper. Define the -Integral Probability Metric (-IPM) [22] between distributions as

 WF(p,q):=supf∈\mcF∣∣\EX∼p[f(X)]−\EX∼q[f(X)]∣∣. (1)

Given samples from distribution , WGAN sets up a family of generators , a family of discriminators , and aims to learn the data distribution by solving

 minq∈\mcG  W\mcF(^pn,^qm) (2)

where denotes “the empirical version of the distribution ”, meaning the uniform distribution over a set of i.i.d samples from (and similarly .)

When , IPM reduces to the Wasserstein-1 distance . In practice, parametric families of functions such as multi-layer neural networks are used for approximating Lipschitz functions, so that we can empirically optimize this objective (2) via gradient-based algorithms as long as distributions in the family have parameterized samplers. (See Section 2 for more details.)

One of the main theoretical and empirical concerns with GANs is the issue of “mode-collapse”[2, 28] — the learned distribution tends to generate high-quality but low-diversity examples. Mathematically, the problem apparently arises from the fact that IPM is weaker than , and the mode-dropped distribution can fool the former [2]: for a typical distribution , there exists a distribution such that simultaneously the followings happen:

 W\mcF(p,q)≲\eps\textupand  W1(p,q)≳1. (3)

where hide constant factors. In fact, setting with , where is a complexity measure of (such as Rademacher complexity), satisfies equation (3) but is clearly a mode-dropped version of when has an exponential number of modes.

Reasoning that the problem is with the strength of the discriminator, a natural solution is to increase it to larger families such as all 1-Lipschitz functions. However, Arora et al. [2] points out that Wasserstein-1 distance doesn’t have good generalization properties: the empirical Wasserstein distance used in the optimization is very far from the population distance. Even for a spherical Gaussian distribution (or many other typical distributions), when the distribution is exactly equal to , letting and be two empirical versions of and with , we have with high probability,

 W1(^pn,^qm)≳1   \textupeventhough   W1(p,q)=0. (4)

Therefore even when learning succeeds (), it cannot be gleaned from the empirical version of .

The observations above pose a dilemma in establishing the statistical properties of GANs: powerful discriminators cause overfitting, whereas weak discriminators result in diversity issues because IPM doesn’t approximate the Wasserstein distance. The lack of diversity has also been observed empirically by [31, 7, 5, 3].

### 1.2 An approach to diversity: conjoined discriminator families

This paper advocates that the conundrum can be solved by designing a discriminator class that is particularly strong against a specific generator class . We call a discriminator class conjoined with a generator class if can distinguish any pairs of distributions approximately as well as all 1-Lipschitz functions can do:

 \textup$F$isconjoinedwith$G$  ≜  ∀p,q∈G, γL(W1(p,q))≲WF(p,q)≲γU(W1(p,q)), (5)

where and are two monotone nonnegative functions with . The paper mostly focuses on with and , although we use the term “conjoined” more generally for this type of result (without tying it to a concrete definition of ). In other words, we are looking for discriminators so that -IPM can approximate the Wasserstein distance for pairs of distributions .

A conjoined discriminator class resolves the dilemma in the following way.

First, avoids mode collapse – if the IPM between and is small, then by the left hand side of (5), and are also close in Wasserstein distance and therefore significant mode-dropping cannot happen. 111Informally, if most of the modes of are -far away from each other, then as long as , has to contain most of the modes of .

Second, we can pass from population-level guarantees to empirical-level guarantees – as shown in Arora et al. [2], classical capacity bounds such as the Rademacher complexity of relate to . Therefore, as long as the capacity is bounded, we can expand on equation (5) to get a full picture of the statistical properties of Wasserstein GANs:

 ∀p,q∈G, γL(W1(p,q))≲WF(p,q)≈W\mcF(^pn,^qm)≲γU(W1(p,q)).

Here the first inequality addresses the diversity property of the distance , and the second approximation addresses the generalization of the distance, and the third inequality provides the reverse guarantee that if the training fails to find a solution with small IPM, then indeed and are far away in Wasserstein distance.222We also note that the third inequality can hold for all as long as is a subset of Lipschitz functions. To the best of our knowledge, this is the first result that tackles the statistical theory of GANs with polynomial samples. Previous works in the non-parametric setting [11, 19] require sample complexity exponential in the dimension .

The main body of the paper will develop techniques for designing conjoined discriminator class for several examples of generator classes including mixtures of Gaussians, exponential families, and especially distributions generated by neural networks. In the next subsection, we will show that properly chosen provides diversity guarantees such as inequalities (5).

### 1.3 Conjoined discriminator design

We start with relatively simple families of distributions such as Gaussian distributions and exponential families, where we can directly design to distinguish pairs of distribution in . As we show in Section 3, for Gaussians it suffices to use one-layer neural networks with ReLU activation as discriminators, and for exponential family to use linear combinations of the sufficient statistics.

In Section 4, we study the family of distributions generated by invertible neural networks. We show that a special type of neural network discriminators with one additional layer than the generator is a conjoined family333This is consistent with the empirical finding that generators and discriminators with similar depths are often near-optimal choices of architectures.. We show this discriminator class guarantees that where here we hide polynomial dependencies on relevant parameters (Theorem 4.2). We remark that such networks can also produce an exponentially large number of modes due to the non-linearities, and our results imply that if is small, then most of these exponential modes will show up in the learned distribution .

A key intermediate step towards the result above is a very simple but powerful observation (Lemma 4.1): if the log density of every distribution can be approximated by the family of discriminators in , then the KL-divergence is bounded from above by . Indeed, the log-density of a neural net generator can be computed by another neural network as long as each weight matrix and the activation function is invertible. With this lemma, it remains to show that KL is bounded below from by Wasserstein distance, which follows from the transportation inequalities for sub-Gaussian measures such as the Bobkov-Götze and Gozlan theorems ([34], restated in Theorem D.1).

One limitation of the invertibility assumption is that it only produces distributions supported on the entire space. The distribution of natural images is often believed to reside approximately on a low-dimensional manifold. When the distribution have a Lebesgue measure-zero support, the KL-divergence (or the reverse KL-divergence) is infinity unless the support of the estimated distribution coincides with the support of .444The formal mathematical statement is that is infinity unless is absolutely continuous with respect to . Therefore, the KL-divergence is fundamentally not the proper measurement of the statistical distance for the cases where both and have low-dimensional supports.

The crux of the technical part of the paper is to establish the approximation of Waserstein distance by IPMs for generators with low-dimensional supports. We will show that some variant of IPM can still be sandwiched by Wasserstein distance as in form of (5) without relating to KL-divergence (Theorem 4.3). This demonstrates the advantage of GANs over MLE approach on learning distributions with low-dimensional supports. As the main proof technique, we develop tools for approximating the log-density of a smoothed neural network generator.

In Section 5, we demonstrate in synthetic and controlled experiments that the IPM correlates with KL-divergence for the invertible generator family (where computation of KL is feasible). The theory suggests the possibility that when the KL-divergence is not measurable in more complicated settings, the test IPM could serve as a candidate alternative for measuring the diversity and quality of the learned distribution. We also remark that on real datasets, often the optimizer is tuned to carefully balance the learning of generators and discriminators, and therefore the reported training loss is often not the test IPM (which requires optimizing the discriminator until optimality.) Anecdotally, the distributions learned by GANs can often be distinguished by a well-trained discriminator from the data distribution, which suggests that the IPM is not well-optimized (See  [20] for analysis of for the original GANs formulation.). The authors conjecture that the lack of diversity in real experiments may be caused by the sub-optimality in optimization, instead of statistical inefficiency.

### 1.4 Related work

Empirical proxy tests for diversity, memorization, and generalization were developed such as interpolation between images [26], semantic combination of images via arithmetic in latent space [4], classification tests [29], etc. These results by and large indicate that while “memorization” is not an issue with most GANs, lack of diversity frequently.

As discussed thoroughly in the introduction, Arora et al. [2, 3] formalized the potential theoretical sources of mode collapse from a weak discriminator, and proposed a “birthday paradox” that convincingly demonstrates this phenomenon is real. Many architectures and algorithms have been proposed to remedy or ameliorate mode collapse ([8, 31, 7, 5]) with varying success. Feizi et al. [10] showed provable guarantees of training GANs with quadratic discriminators when the generators are Gaussians. However, to the best of our knowledge, there are no provable solutions to this problem in more generality.

The invertible generator structure was used in Flow-GAN [13], which observes that GAN training blows up the KL on real dataset. In contrast, our theoretical result and experiments show that shows that successful GAN training (in terms of the IPM) does imply learning in KL-divergence when the data distribution is realized in this generator class, therefore suggesting that the KL problem happens because the “true data generator” is not realizable in this class.

## 2 Preliminaries and Notation

The notion of IPM (recall the definition in (1)) includes a number of statistical distances such as TV (total variation) and Wasserstein-1 by taking to be 1-bounded and 1-Lipschitz functions respectively. When is a class of neural networks, we refer to the -IPM as the neural net IPM.555This was defined as neural net distance in [2].

There are many distances between distributions of interest that are not IPMs, two of which we mostly focuse on: the KL divergence (when the densities exist), and the Wasserstein-2 distance, defined as where be the set of all couplings of . We will only consider distributions with finite second moments, so that and exist.

For any distribution , we let be the empirical distribution of i.i.d. samples from . The Rademacher complexity of a function class on a distribution is where i.i.d. and are independent. We define to be the largest Rademacher complexity over . The training IPM loss (over the entire dataset) for the Wasserstein GAN, assuming discriminator reaches optimality, is 666In the ideal case we can take the expectation over , as the generator is able to generate infinitely many samples.. Generalization of the IPM is governed by the quantity , as stated in the following result (see Appendix A.1 for the proof). {theorem}[Generalization, c.f. [2]] For any , we have that

 ∀q∈G,  \E^pn|W\mcF(p,q)−\E^qn[W\mcF(^pn,^qn)]|≤4Rn(F,G).

Miscellaneous notation. We let denote a (multivariate) Gaussian distribution with mean and covariance . For quantities denotes that for a universal constant unless otherwise stated explicitly.

## 3 Conjoined Discriminators for Basic Distributions

### 3.1 Gaussian distributions

As a warm-up, we design conjoined discriminators for relatively simple parameterized distributions such Gaussian distributions, exponential families, and mixtures of Gaussians. We first prove that one-layer neural networks with ReLU activation are strong enough to distinguish Gaussian distributions with the conjoinedness guarantees.

We consider the set of Gaussian distributions with bounded mean and well-conditioned covariance . Here and are considered as given hyper-parameters. We will show that the following discriminators are conjoined with :

 \mcF\defeq\setx↦\textupReLU(v⊤x+b):\ltwov≤1,|b|≤D, (6)
{theorem}

The set of one-layer neural networks ( defined in equation (6)) is conjoined with the Gaussian distributions (defined in (3.1)) in the sense that for any

 κ⋅W1(p,q)≲WF(p,q)≤W1(p,q).

with . Moreover, satisfies Rademacher complexity bound . Apart from absolute constants, the lower and upper bound differs by a factor of .777As shown in [10], the optimal discriminator for Gaussian distributions are quadratic functions. We point out that the factor is not improvable unless using functions more sophisticated than Lipschitz functions of one-dimensional projections of . Indeed, is upper bounded by the maximum Wasserstein distance between one-dimensional projections of , which is on the order of when have spherical covariances. The proof is deferred to Section B.1.

Extension to mixture of Gaussians. Conjoined discriminator family can also be designed for mixture of Gaussians. We defer this result and the proof to Appendix C.

### 3.2 Exponential families

Now we consider exponential families and show that the linear combination of the sufficient statistics are naturally a family of conjoined discriminators. Concretely, let be an exponential family where : here is the vector of sufficient statistics, and is the partition function. Let the discriminator family be all linear functionals over the features : {theorem} Let be the exponential family and be the discriminators defined above. Assume that the log partition function satisfies that . Then we have for any ,

 γ√β√\dkl(p∥q)≤W\mcF(p,q)≤β√γ√\dkl(p∥q). (7)

If we further assume has diameter and is -Lipschitz in . Then,

 Dγ√βW1(p,q)≲W\mcF(p,q)≤L⋅W1(p,q) (8)

Moreover, has Rademacher complexity bound . We note that the log partition function is always convex, and therefore our assumptions only require in addition the curvature is strictly positive. Some geometric assumptions on the sufficient statistics for the second bound are necessary because the Wasserstein distance intrinsically depends on the geometry, whereas the exponential family does not encode such information. The proof of equation (7) follows straightforwardly from the standard theory of exponential families. The proof of equation (8) requires machinery that we will develop in Section 4 and is therefore deferred to Section B.2.

## 4 Conjoined Discriminators for Neural Net Generators

In this section, we design conjoined discriminators for neural net generators, a family of distributions that are widely used in GANs to model the distribution of real data.

In Section 4.1, we provide a helper inequality that bounds the KL divergence from above by IPMs for the generators with proper definition of densities under the Lebesgue measure, and in Section 4.2 we apply it to the setting of invertible neural networks generators. In Section 4.3, we extend the results to the more general setting of degenerate neural networks generators where the latent variables are allowed to have lower dimension than the observable dimensions (Theorem 4.3).

### 4.1 Bounding IPMs from below by KL (for distributions with proper density)

In this subsection, we assume that the true distribution and all distributions are absolutely continuous with respect to the Lebesgue measure, and we will use and to denote the densities of and . In section 4.3 we will relax this assumption and consider distributions without proper density with respect to Lebesgue measure.

A priori, the optimal way to design the discriminator class for a general is to include parameterized approximations of all the optimal discriminators for all possible pairs . However, the optimal discriminator for a pair of distributions and may be in general a very complicated Lipschitz function, so it seems hard to argue about its approximability by parameterized families of discriminators.

The key observation here is that we can approximate, instead of , other functions that have similar distinguishing power to . In particular, one such function is . This leads to the following generic theorem which states that as long as contains all functions of the form , then we can approximate the Wasserstein distance by the IPM .

{lemma}

Let . Suppose satisfies that for every , there exists such that , and that all the functions in are -Lipschitz. Then,

 \dkl(p∥q)+\dkl(q∥p)−\eps≤W\mcF(p,q)≤L⋅W1(p,q). (9)

Proof: The upper bound follows directly by the definition of . For the lower bound, note that for any two distributions , we have

 \Ep[logp(X)−logq(X)]−\Eq[logp(X)−logq(X)]=\dkl(p∥q)+\dkl(q∥p). (10)

By assumption there exists an that approximates pointwise to accuracy , so using this we have

 W\mcF(p,q)≥\Ep[f(X)]−\Eq[f(X)]≥\dkl(p∥q)+\dkl(q∥p)−\eps.\displaymath@qed

{remark}

[Comparison to MLE and variational inference] We first remark that the assumption on the approximability of doesn’t imply that the distribution in can be trained by MLE. Applying MLE requires a differentiable formula to compute . However, the assumption in Lemma 4.1 only requires be approximated by some parametric function where doesn’t have to be computable from . A clean example are exponential families: it’s intractable to evaluate the log density, but the log density is trivially approximable by where is a constant that corresponds to the log partition function.
The assumption, however, is closely related to variational inference. Variational inference or variational auto-encoders [17] assume that the posterior distribution of the latent variable can be approximated by some parametric function (e.g., neural networks). This implies our assumption: can be approximated because 1) , 2) , are computable by definition, and 3) can be approximated.
This connection indicates that GANs may be as powerful as variational auto-encoders, and are likely to be stronger because the discriminators may have more distinguishability beyond approximating the log density. Our theory doesn’t justify this possibility yet and it’s left for future work. Finally, another advantage of GANs over VAE or MLE is that it can be applied to generated distributions with low-dimensional supports, as will be shown in Section 4.3.

### 4.2 Invertible neural network generators

In this section, we consider the generators that are parameterized by invertible neural networks888Our techniques also applies to other parameterized invertible generators but for simplicity we only focus on neural networks.. Concretely, Let be a family of neural networks . Let be the distribution of

 X=Gθ(Z),  Z∼\normal(0,\diag(γ2)). (11)

where is a neural network with parameters and standard deviation of hidden factors. By allowing the variances to be non-spherical, we allow each hidden dimension to have a different impact on the output distribution. In particular, the case for some has the ability to model data around a “-dimensional manifold” with some noise on the level of .

We are interested in the set of invertible neural networks . We let our family consist of standard -layer feedforward nets of the form

 x=Wℓσ(Wℓ−1σ(⋯σ(W1z+b1)⋯)+bℓ−1)+bℓ,

where are invertible, , and is the activation function, on which we make the following assumption: {assumption}[Invertible generators] Let be parameters which are considered as constants (that may depend on the dimension). We consider neural networks that are parameterized by parameters belonging to the set

 Θ=\set(Wi,bi)i∈[ℓ]: max\set\opnormWi,\opnormW−1i≤RW, \ltwobi≤Rb, ∀i∈[ℓ].

The activation function is twice-differentiable with , , and . The standard deviation of the hidden factors satisfy . Clearly, such a neural net is invertible, and its inverse is also a feedforward neural net with activation . We note that a smoothed version of Leaky ReLU [37] satisfies all the conditions on the activation functions. Further, some assumptions on the neural networks are necessary because arbitrary neural networks are likely to be able to implement pseudo-random functions which can’t be distinguished from random functions by even any polynomial time algorithms.

{lemma}

For any , the function can be computed by a neural network with at most layers, parameters, and activation function among of the form

 fϕ(x)=12\

where , for , and the parameter satisfies . As a direct consequence, the following family of neural networks with activation functions above of at most layers contains all the functions

 F=\setfϕ1−fϕ2:ϕ1,ϕ2∈Φ. (13)

We note that the exact form of the parameterized family is likely not very important in practice, since other family of neural nets also possibly contain good approximations of (which can be seen partly from experiments in Section 5.)

The proof builds on the change-of-variable formula (where is the density of ) and the observation that is a feedforward neural net with layers. Note that the log-det of the Jacobian involves computing the determinant of the (inverse) weight matrices. A priori such computation is non-trivial for a given . However, it’s just some constant that does not depend on the input, therefore it can be representable by adding a bias on the final output layer. This frees us from further structural assumptions on the weight matrices (in contrast to the architectures in flow-GANs [14]). We defer the proof of Lemma 4.2 to Section D.2.

{theorem}

Suppose is the set of invertible-generator distributions as defined in (11) satisfying Assumption 4.2. Then, the discriminator class defined in Lemma 4.2 is conjoined with in the sense that for any ,

 W1(p,q)2≲\dkl(p∥q)+\dkl(q∥p)≤W\mcF(p,q)≲√dδ2(W1(p,q)+dexp(−10d)),

When , we have the generalization bound .

We outline a proof sketch below and defer the full proof to Appendix D.3. We choose the discriminator class as in Lemma 4.2. As it implements , for any , by Theorem 4.1, is lower bounded by . It thus suffices to (1) lower bound this quantity by the Wasserstein distance and (2) upper bound by the Wasserstein distance.

To establish (1), we will prove in Lemma D.3 that for any ,

 W1(p,q)2≤W2(p,q)2≲\dkl(p∥q)+\dkl(q∥p).

Such a result is the simple implication of transportation inequalities by Bobkov-Götze and Gozlan (Theorem D.1), which state if (or ) and is -Lipschitz implies that is sub-Gaussian, then the inequality above holds. In our invertible generator case, we have where are independent Gaussians, so as long as is suitably Lipschitz, is a sub-Gaussian random variable by the standard Gaussian concentration result [35].

The upper bound (2) would have been immediate if functions in are Lipschitz globally in the whole space. While this is not strictly true, we give two workarounds – by either doing a truncation argument to get a bound with some tail probability, or a bound which only requires the Lipschitz constant to grow at most linearly in . This is done in Theorem D.1 as a straightforward extension of the result in [25].

Combining the conjoinedness and the generalization bound, we immediately obtain that if the training succeeds with small expected IPM (over the randomness of the learned distributions), then the estimated distribution is close to the true distribution in Wasserstein distance. {corollary} In the setting of Theorem 4.2, with high probability over the choice of training data , we have that if the training process returns a distribution such that , then with , we have

 W1(p,q)≤W2(p,q)≲(\epstrain+\epsgen)1/2. (14)

We note that the training error is measured by , the expected IPM over the randomness of the learned distributions, which is a measurable value because one can draw fresh samples from to estimate the expectation. It’s an important open question to design efficient algorithm to achieve a small training error according to this definition, and this is left for future work.

### 4.3 Degenerate neural network generators

In this section we consider the degenerate neural network generator which results in distributions residing on a low dimensional manifold. This is a more realistic setting than Section 4.2 for modeling real images, but technically more challenging because the KL divergence becomes infinity, rendering Lemma 4.1 useless. Nevertheless, we design a novel divergence between two distributions that is sandwiched by Wasserstein distance and can be optimized as IPM.

Concretely, we consider a family of neural net generators where and is injective function. 999In other words, if . Therefore, is invertible only on the image of , which is a -dimensional manifold in . Let be the corresponding family of distributions generated by neural nets in .

Our key idea is to design a variant of the IPM, which provably approximates the Wasserstein distance. Let denote the convolution of the distribution with a Gaussian distribution . We define a smoothed -IPM between as

 ~dF(p,q)≜infβ≥0 (W\mcF(pβ,qβ)+βlog1/β)1/2, (15)

Clearly can be optimized as with an additional variable introduced in the optimization. We show that for certain discriminator class (see Section E for the details of the construction) such that approximates the Wasserstein distance. {theorem}[Informal version of Theorem E.1] Let be defined as above. The exists a discriminator class such that for any pair of distributions , we have

 W1(p,q)≲~dF(p,q)≲poly% (d)⋅W1(p,q)1/6+exp(−Ω(d)). (16)

Furthermore, when , we have the generalization bound

 Rn(\mcF,G)≲poly(d)√lognn

Here hides polynomial dependencies on and several other parameters that will be formally defined in the formal version (Theorem 4.3.) The direct implication of the theorem is that if is small for , then is guaranteed to be also small and thus we don’t have mode collapse.

## 5 Simulations

We perform synthetic WGAN experiments with invertible neural net generator (cf. Section 4.2) and conjoined discriminator design (Lemma 4.2). Our main goal is to demonstrate that the empirical IPM is well correlated with the KL-divergence between and on synthetic data, in for various pairs of and (The true distribution is generated randomly from a ground-truth neural net, and the distribution is learned using various algorithms or perturbed version of .)

### 5.1 Setup

##### Data

The data is generated from a ground-truth invertible neural net generators (cf. Section 4.2), i.e. , where is a -layer layer-wise invertible feedforward net, and is a spherical Gaussian. We use the Leaky ReLU with negative slope 0.5 as the activation function , whose derivative and inverse can be very efficiently computed. The weight matrices of the layers are set to be well-conditioned with singular values in between to .

We choose the discriminator architecture according to the conjoined discriminator design (Lemma 4.2, equation (12) and (13)). As is a piecewise constant function that are not differentiable, we instead model it as a trainable one-hidden-layer neural network that maps reals to reals. We add constraints on all the parameters in accordance with Assumption 4.2.

##### Training

To train the generator and discriminator networks, we generate stochastic batches (with batch size 64) from both the ground-truth generator and the trained generator, and solve the min-max problem in the Wasserstein GAN formulation. We perform 10 updates of the discriminator in between each generator step, with various regularization methods for discriminator training (specified later). We use the RMSProp optimizer [32] as our update rule.

##### Evaluation metric

We evaluate the following metrics between the true and learned generator.

1. The KL divergence. As the density of our invertible neural net generator can be analytically computed, we can compute their KL divergence from empirical averages of the difference of the log densities:

 \what\dkl(p⋆,p)=\EX∼\whatp⋆n[logp⋆(X)−logp(X)],

where and are the densities of the true generator and the learned generator. We regard the KL divergence as the “correct” and rather strong criterion for distributional closedness.

2. The training loss (IPM train). This is the (unregularized) GAN loss during training. Note: as typically in the training of GANs, we balance carefully the number of steps for discriminator and generators, the training IPM is potentially very away from the true (which requires sufficient training of the discriminators).

3. The neural net IPM ( eval). We report once in a while a separately optimized WGAN loss in which the learned generator is held fixed and the discriminator is trained from scratch to optimality. Unlike the training loss, here the discriminator is trained in norm balls but with no other regularization. By doing this, we are finding that maximizes the contrast and we regard the found by stochastic optimization an approximate maximizer, and the loss obtained an approximation of .

Our theory shows that for this conjoined choice of and , WGAN is able to learn the true generator in KL divergence, and the -IPM (in evaluation instead of training) should be indicative of the KL divergence. We test this hypothesis in the following experiments.

### 5.2 Convergence of generators in KL divergence

In our first experiment, is a two-layer net in dimensions. Though the generator is only a shallow neural net, the presence of the nonlinearity makes the estimation problem non-trivial. We train a discriminator with the conjoined architecture, using either Vanilla WGAN (clamping the weight into norm balls) or WGAN-GP [14] (adding a gradient penalty). We fix a same ground-truth generator and run each method from 6 different random initializations. Results are plotted in Figure 1.

Our main findings are two-fold:

1. WGAN training with conjoined discriminator design is able to learn the true distribution in KL divergence. Indeed, the KL divergence starts at around 10 - 30 and the best run gets to KL lower than 1. As KL is a rather strong metric between distributions, this is strong evidence that GANs are finding the true distribution and mode collapse is not happening.

2. The (eval) and the KL divergence are highly correlated with each other, both along each training run and across different runs. In particular, adding gradient penalty improves the optimiztaion significantly (which we see in the KL curve), and this improvement is also reflected by the curve. Therefore the quantity can serve as a good metric for monitoring convergence and is at least much better than the training loss curve.

To test the necessity of the specific form of the conjoined discriminator we designed, we re-do the same experiment with vanilla fully-connected discriminator nets. Results (in Appendix F) show that IPM with vanilla discriminators also correlate well with the KL-divergence. This is not surprising from a theoretical point of view because a standard fully-connected discriminator net (with some over-parameterization) is likely to be able to approximate the log density of the generator distributions (which is essentially the only requirement of Lemma 4.1.)

For this synthetic case, we can see that the inferior performance in KL of the WGAN-Vanilla algorithm doesn’t come from the statistical properties of GANs, but rather the inferior training performance in terms of the convergence of the IPM. We conjecture similar phenomenon occurs in training GANs with real-life data as well.

### 5.3 Perturbed generators

In this section, we remove the effect of the optimization and directly test the correlation between and its perturbations. We compare the KL divergence and neural net IPM on pairs of perturbed generators. In each instance, we generate a pair of generators (with the same architecture as above), where is a perturbation of by adding small Gaussian noise. We compute the KL divergence and the neural net IPM between and . To denoise the unstable training process for computing the neural net IPM, we optimize the discriminator from 5 random initializations and pick the largest value as the output.

As is shown in Figure 2, there is a clear positive correlation between the (symmetric) KL divergence and the neural net IPM. In particular, majority of the points fall around the line , which is consistent with our theory that the neural net distance scales linearly in the KL divergence. Note that there are a few points with outlyingly large KL. This happens mostly due to the perturbation being accidentally too large so that the weight matrices become poorly conditioned – in view of our theory, they fall out of the good constraint set as defined in Assumption 4.2.

## 6 Conclusion

We present the first polynomial-in-dimension sample complexity bounds for learning various distributions (such as Gaussians, exponential families, invertible neural networks generators) using GANs with convergence guarantees in Wasserstein distance (for distributions with low-dimensional supports) or KL divergence. The analysis technique proceeds via designing conjoined discriminators – a class of discriminators tailored to the generator class in consideration which have good generalization and mode collapse avoidance properties.

We hope our techniques can be in future extended to other families of distributions with tighter sample complexity bounds. This would entail designing better conjoined discriminators, and generally exploring and generalizing approximation theory results in the context of GANs. We hope such explorations will prove as rich and satisfying as they have been in the vanilla functional approximation settings.

### Acknowledgments

The authors would like to thank Leon Bottou and John Duchi for many insightful discussions.

## References

• Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
• Arora et al. [2017a] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pages 224–232, 2017a.
• Arora et al. [2017b] S. Arora, A. Risteski, and Y. Zhang. Do gans actually learn the distribution? do gans learn the distribution? some theory and empirics. ICLR, 2017b.
• Bojanowski et al. [2017] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam. Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776, 2017.
• Borji [2018] A. Borji. Pros and cons of gan evaluation measures. arXiv preprint arXiv:1802.03446, 2018.
• Demmel et al. [2007] J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59–91, 2007.
• Di and Yu [2017] X. Di and P. Yu. Max-boost-gan: Max operation to boost generative ability of generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1156–1164, 2017.
• Dumoulin et al. [2016] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
• Durugkar et al. [2016] I. Durugkar, I. Gemp, and S. Mahadevan. Generative Multi-Adversarial Networks. ArXiv e-prints, Nov. 2016.
• Feizi et al. [2017] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
• Fournier and Guillin [2015] N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.
• Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
• Grover et al. [2018] A. Grover, M. Dhar, and S. Ermon. Flow-gan: Combining maximum likelihood and adversarial learning in generative models. In AAAI Conference on Artificial Intelligence, 2018.
• Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
• Huang et al. [2017] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In Computer Vision and Patter Recognition, 2017.
• Jiwoong Im et al. [2016] D. Jiwoong Im, H. Ma, C. Dongjoo Kim, and G. Taylor. Generative Adversarial Parallelization. ArXiv e-prints, Dec. 2016.
• Kingma and Welling [2013] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
• Ledoux and Talagrand [2013] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
• Liang [2017] T. Liang. How well can generative adversarial networks (gan) learn densities: A nonparametric view. arXiv preprint arXiv:1712.08244, 2017.
• Lopez-Paz and Oquab [2016] D. Lopez-Paz and M. Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
• Masarotto et al. [2018] V. Masarotto, V. M. Panaretos, and Y. Zemel. Procrustes metrics on covariance operators and optimal transportation of gaussian processes. arXiv preprint arXiv:1801.01990, 2018.
• Müller [1997] A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
• Nguyen et al. [2017] H. Nguyen, T. Tao, and V. Vu. Random matrices: tail bounds for gaps between eigenvalues. Probability Theory and Related Fields, 167(3-4):777–816, 2017.
• Odena et al. [2016] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
• Polyanskiy and Wu [2016] Y. Polyanskiy and Y. Wu. Wasserstein continuity of entropy and outer bounds for interference channels. IEEE Transactions on Information Theory, 62(7):3992–4002, 2016.
• Radford et al. [2016] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.
• Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
• Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
• Santurkar et al. [2017] S. Santurkar, L. Schmidt, and A. Madry. A classification-based perspective on gan distributions. arXiv preprint arXiv:1711.00970, 2017.
• Schmitt [1992] B. A. Schmitt. Perturbation bounds for matrix square roots and pythagorean sums. Linear algebra and its applications, 174:215–227, 1992.
• Srivastava et al. [2017] A. Srivastava, L. Valkoz, C. Russell, M. U. Gutmann, and C. Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pages 3310–3320, 2017.
• Tieleman and Hinton [2012] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
• Tolstikhin et al. [2017] I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and B. Schölkopf. Adagan: Boosting generative models. arXiv preprint arXiv:1701.02386, 2017.
• van Handel [2014] R. van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
• Vershynin [2010] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• Wainwright [2018] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. To appear, 2018.
• Xu et al. [2015] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.

## Appendix A Proofs for Section 2

### a.1 Proof of Theorem 6

Fixing , consider a random sample . It is easy to verify that the -distance satisfies the triangle inequality, so we have

 |W\mcF(p,q)−\E^qn[W\mcF(^pn,^qn)]|≤\E^qn[|W\mcF(p,q)−W\mcF(^pn,^qn)] ≤\E^qn[W\mcF(p,^pn)+W\mcF(q,^qn)]=W\mcF(p,^pn)+\E^qn[W\mcF(q,^qn)].

Taking expectation over on the above bound yields

 \E^pn[|W\mcF(p,q)−\E^qn[W\mcF(^pn,^qn)]|]≤\E^pn[W\mcF(p,^pn)]+\E^qn[W\mcF(q,^qn)].

So it suffices to bound by and the same bound will hold for . Let be the samples in . By symmetrization, we have

 W\mcF(p,^pn)=\E[supf∈\mcF∣∣ ∣∣1nn∑i=1f(Xi)−\Ep[f(X)]∣∣ ∣∣]≤2\E[supf∈\mcF∣∣ ∣∣1nn∑i=1\epsif(Xi)∣∣ ∣∣]=2\E[Rn(\mcF,p)]≤2Rn(F,G).

Adding up this bound and the same bound for gives the desired result.

## Appendix B Proofs for Section 3

### b.1 Proof of Theorem 3.1

Recall that our discriminator family is

 \mcF=\setx↦σ(v⊤x+b):\ltwov≤1,|b|≤D.
##### Conjoinedness

The upper bound follows directly from the fact that functions in are 1-Lipschitz.

We now establish the lower bound. First, we recover the mean distance, in which we use the following simple fact: a linear discriminator is the sum of two ReLU discriminators, or mathematically . Taking , we have

 \ltwoμ1−μ2=v⊤μ1−v⊤μ2=\Ep1[v⊤X]−\Ep2[v⊤X] = (\Ep1[σ(v⊤X)]−\Ep2[σ(v⊤X)])+(−\Ep1[σ(−v⊤X)]+\Ep2[σ(−v⊤X)]) ≤ ∣∣\Ep1[σ(v⊤X)]−\Ep2[σ(v⊤X)]∣∣+∣∣\Ep1[σ(−v⊤X)]−\Ep2[σ(−v⊤X)]∣∣.

Therefore at least one of the above two terms is greater than , which shows that .

For the covariance distance, we need to actually compute for . Note that , where . Further, we have for , therefore

 \Ep[σ(v⊤X+b)]=\E[σ(\ltwoΣ1/2vW+v⊤μ+b)] =\ltwoΣ1/2v\E[σ(W+v⊤μ+b\ltwoΣ1/2v)]=\ltwoΣ1/2vR(v⊤μ+b\ltwoΣ1/2v).

(Defining for .) Therefore, the neuron distance between the two Gaussians is

 WF(p1,p2)=sup\ltwov≤1,|b|≤D∣∣ ∣∣\ltwoΣ1/21vR⎛⎝v⊤μ1+b\ltwoΣ1/21v⎞⎠−\ltwoΣ1/22vR⎛⎝v⊤μ2+b\ltwoΣ1/22v⎞⎠∣∣ ∣∣,

As is strictly increasing for all , the function is strictly increasing. It is also a basic fact that .

Consider any fixed . By flipping the sign of , we can let without changing . Now, letting (note that is a valid choice), we have

 v⊤μ1+b=v⊤(μ1−μ2)2≥0,  v⊤μ2+b=−v⊤(μ1−μ2)2≤0.

As is strictly increasing, for this choice of we have

 \ltwoΣ1/21vR⎛⎝v⊤μ1+b\ltwoΣ1/21v⎞⎠−\ltwoΣ1/22vR⎛⎝v⊤μ2+b\ltwoΣ1/22v⎞⎠ ≥R(0)(\ltwoΣ1/21v−\ltwoΣ1/22v)=1√2π(\ltwoΣ1/21v−\ltwoΣ1/22v).

Ranging over we then have

 WF(p1,p2)≥1√2πsup\ltwov≤1∣∣\ltwoΣ1/21v−\ltwoΣ1/22v∣∣.

The quantity in the supremum can be further bounded as

 ∣∣\ltwoΣ1/21v−\ltwoΣ1/22v∣∣=|v⊤(Σ1−Σ2)v|\ltwoΣ1/21v+\ltwoΣ1/22v≥|v⊤(Σ1−Σ2)v|\lambdamaxΣ1/21+\lambdamaxΣ1/22.

Choosing gives

 WF(p1,p2)≥1√2πsup\ltwov≤1∣∣\ltwoΣ1/21v−\ltwoΣ1/22v∣∣≥\opnormΣ1−Σ2√2π2σmax.

Now, using the perturbation bound

 \opnormΣ1/21−Σ1/22≤1λmin(Σ1)+λmin(Σ2)⋅\opnormΣ1−Σ2≤12σmin\opnormΣ1−Σ2,

(cf. [30, Lemma 2.2]), we get

 WF</