How Well Can Generative Adversarial Networks (GAN) Learn Densities: A Nonparametric View
We study in this paper the rate of convergence for learning densities under the Generative Adversarial Networks (GANs) framework, borrowing insights from nonparametric statistics. We introduce an improved GAN estimator that achieves a faster rate, through leveraging the level of smoothness in the target density and the evaluation metric, which in theory remedies the mode collapse problem reported in the literature. A minimax lower bound is constructed to show that when the dimension is large, the exponent in the rate for the new GAN estimator is near optimal. One can view our results as answering in a quantitative way how well GAN learns a wide range of densities with different smoothness properties, under a hierarchy of evaluation metrics. As a byproduct, we also obtain improved bounds for GAN with deeper ReLU discriminator network.
Generative Adversarial Networks (GANs)  have stood out as an important unsupervised method for learning and efficient sampling from a complicated, multi-modal target data distribution. Despite its celebrated empirical success in image tasks, there are many theoretical questions to be answered .
One convenient formulation of the GAN framework  solves the following minimax problem, at the population level,
In plain language, given a target real distribution , one seeks for a distribution from a probability distribution generator class , such that it minimizes the loss incurred by the best discriminator function in the discriminator class . In practice, both the generator class and the discriminator class are represented by neural networks: quantifies the transformed distributions realized by a network with random inputs (either Gaussian or uniform distribution), and represents the functions that are realizable by a certain neural network architecture. We refer the readers to  for other more general formulations of GAN.
In practice, one only has access to finite samples of the real data distribution . Denote to be a measure estimate based on i.i.d. samples from , where the empirical density is typically used. Given the samples, the GAN solves the following problem
Two fundamental yet basic questions that puzzle machine learning theorists are: (1) how well does GAN learn the densities statistically, overlooking the optimization difficulty? (2) how well does the iterative optimization approach of solving the minimax problem approximate the optimal solution?
Density estimation has been a central topic in nonparametric statistics . The minimax optimal rate of convergence has been understood fairly well, for a wide range of density function classes quantified by its smoothness. We would like to point out that in nonparametric statistics, the model grows in size to accommodate the complexity of the data, which is reminiscent of the model complexity in GANs, and more generally in the deep neural networks.
The current paper studies the GAN framework for learning densities from a nonparametric point of view. The focus of the current paper is not on the optimization side of how to solve for in an computationally efficient way, rather on the theoretical front: how well estimates a wide range of nonparametric distributions under a wide collection of objective metrics, and how to improve the GAN procedure with better theoretical guarantee such as rate of convergence.
Note the GAN framework mentioned above is flexible. Define the following metric induced by the function class ,
If contains all Lipschitz- functions, then is the Wasserstein-1 metric (Wasserstein-GAN ). When represents all functions bounded by , then is equivalent to the total variation metric (Radon metric). Let be a reproducing kernel Hilbert space (RKHS) and be its kernel. If consists of functions in the closure of the span of the set , then (MMD-GAN ).
We summarize the main contributions here. Before introducing the results, let’s mention the Sobolev spaces formally defined in Def. ? and ?. We use to denote the Sobolev space with smoothness , with -Sobolev norm bounded by , for . Denote the Wasserstein- metric as .
Consider the density function that lies in Sobolev space with smoothness parameter
and the discriminator function class with smoothness . As varies, describes a wide range of nonparametric densities, and provides a rich hierarch of metrics.
Smoothness motivates new procedures. We introduce a new GAN estimator based on a regularized/smoothed version of the empirical measure , which enjoys the rate
In contrast, as long as , the GAN estimator with empirical measure only achieves a considerably slower rate
which doesn’t adapt to the smoothness of the target measure . Remark that the regularized/smoothed empirical measure estimate can theoretically be viewed as a remedy to the known mode-collapse (or memorizing data ) problem reported in GAN.
Nonparametric estimation with GAN framework. The GAN framework for nonparametric density estimation enjoys the upper bound
One may wonder whether this rate can be significantly improved by other approaches. We show that for any procedure based on samples, the minimax lower bound under the metric reads
In the case when is large, the exponent in the upper and lower bounds are very close to each other, in the sense that .
In the case when both the generator and discriminator are realized by neural networks. We establish the following results using the insights we gain from above.
GAN can estimate densities. We make progress on answering how well generative adversarial networks estimates nonparametric densities with smoothness parameter , under the Wasserstein metric. Let be the sample size, be the dimension, and be the approximation accuracy. Consider the GAN estimator in . Assuming the discriminator neural network can approximate functions in within , and the generator neural network can approximate densities in within . Then learns the true distribution with accuracy
In addition, for all estimators based on samples, the minimax rate cannot be smaller than
Here are constants independent of . Note that by , one can approximate within accuracy by ReLU networks with depth and units for integer . The formal and more general statment can be found in Thm. ?.
Error rates for ReLU discriminator networks. Let be the collection of functions realized by the feedforward ReLU network with depth , where for each unit in the network, the vector of weights associated with that unit has . Let the true density . Then there exist a regularized/smoothed GAN estimator that satisfies the upper bound,
This improves significantly upon the known bound on obtained using empirical density —which is  — as long as , and allows for better guarantee for deeper networks.
Let’s introduce the notations, and functional spaces (for functions and densities) used in this paper. During the discussion, we restrict the feature space to be . We use to denote the measure (or distribution), and the Radon-Nikodym derivatives , to denote the corresponding density functions. denotes the -norm under the Lebesgue measure, for . With slight abuse of notation, for probability measures we denote and similarly for . For vector , denotes the vector -norm.
The definition naturally extends to fractional through the Bessel potential, with denotes the Fourier transform of , and as its inverse.
Another equivalent definition of Sobolev space for is through the coefficients of the generalized Fourier series, which is also referred to as the Sobolev ellipsoid.
It is clear that (frequency domain) is an equivalent representation of (spacial domain) in for trigonometric Fourier series. For more details on Sobolev spaces, we refer to .
We denote , if holding other parameters fixed, similarly if . Denote if and . We use the notation to denote the index set, for any .
2A nonparametric view of the GAN framework
2.1An oracle inequality
The following oracle inequality holds for GAN.
Let’s remark on the decompositions. Eqn. use as the objective evaluation metric. The first term is the best approximation error within the generator density class, even if we have population access to the true measure . And the second term is the stochastic error, also called generalization error, due to the fact that we have only finite -samples. Eqn. use a different as the objective metric, while using the discriminator class in the GAN procedure. The first term describes the approximation error induced by the generator, the second term corresponds to how well the discriminator serves as a surrogate for the objective metric, and the third term is the stochastic/generalization error.
Using the above theorem, if we use the as the objective metric, we need to study the excess loss
We will start with a crude bound when is chosen as the empirical density. It turns out that one can significantly improve this bound through choosing a better “regularized” or “smoother” version of empirical measure, in the context of learning nonparametric densities. The empirical measure is not optimal because one can leverage the complexity/smoothness of the measure , and the complexity of to improve upon the generalization error.
2.2Upper bound for arbitrary density
We start with a simple bound on with , the empirical measure. We will illustrate why it is suboptimal in using GAN to learn smooth densities, and further how to improve by using a regularized estimate as a plug-in for GAN.
The above lemma is a standard symmetrization followed by the Dudley entropy integral. When applying the symmetrization lemma, one often discard the distribution information about , and thus end up with distribution independent guarantee. The reason is that one bound the empirical -covering number by -covering number on the function class (independent of )
Plugging in the entropy estimate for various functional classes ( and reference therein), one can easily derive the following corollary.
It is easy to see that overlooking the distributional information — by simply going through symmetrization and empirical processes theory — can lead to suboptimal result when the true density is smooth. Roughly speaking, the symmetrization/empirical processes approach treat the true density as a very non-smooth one with (as the empirical density). We will spend the next sections to investigate how to improve GAN when the true density lies in Sobolev spaces .
2.3Smoothness helps: improved upper bound for Sobolev spaces
In this section we will show that one can achieve faster rates in the GAN framework for density estimation, leveraging the smoothness information in the density and the metric .
Suppose the density function , and . Claim that there exist , a smoothed/regularized version of empirical measure such that plugging into GAN will result in a faster rate that adapts to the smoothness of the true density. Remark that in practice, one can take as a kernel density estimate , where sample a mini-batch of data from is just as simple for stochastic gradient updates in GAN optimization.
Let’s broaden our discussions slightly by considering different base measures beyond the Lebesgue measure, and the generalized Fourier basis. One can think of as the uniform measure on or the product Gaussian measure. An equivalent formulation of the GAN problem that translates learning a distribution to learning the importance score/density ratio function with respect to a base measure , can be viewed as
where , and . Consider the generalized Fourier basis w.r.t the base measure that satisfies
where is the multi-index. For any function , one can represent the function in the generalized Fourier basis
The Sobolev ellipsoid — the coefficients lies in — quantifies the smoothness of the function . As a special case, when the base measure is uniform distribution on ,
In general, for any base measure and its corresponding Fourier series , one can easily extend Theorem ? using Sobolev ellipsoid .
2.4Minimax lower bound
Can the rate obtained by GAN framework be significantly improved by any other procedure? We consider in this section the minimax lower bound for the intrinsic difficulty of nonparametric estimation under the GAN discriminator metric. Here we consider a nonparametric function estimation problem on the Sobolev ellipsoid , which is statistically equivalent to the density estimation problem over the Sobolev space asymptotically . A similar lower bound for the density estimation within the Hölder class is proved in Thm. ?.
Consider the problem of estimating the function , where the coefficients belongs to a Sobolev ellipsoid . What one observes are i.i.d normal sequences
Based on observations , we want to know how well one can estimate w.r.t the following metric
3How well GAN learns densities
In this section, we answer in a quantitative way that when both the generator class and the discriminator class represented by deep ReLU networks are rich enough, one do learn the distribution. One can view our results as establishing rate of convergence and fundamental difficulty for learning the distribution under the GAN framework, for a wide range of densities with different smoothness properties. It builds a more detailed theory upon the seminal work of , where they proved that when sample sizes, generator sizes, and discriminator sizes are all infinite, then one learns the distribution.
3.1Deep ReLU networks can learn densities
Let’s remark on the universal approximation property of deep networks first before introducing our result. It is well known that deep neural networks are universal approximators . In particular,  constructed a fixed ReLU network architecture, denoted as , that enjoys the following two properties:
It approximates all functions in in the following sense
It has depth at most and at most weights and computation units.
With the above in mind, we are ready to state the following theorem.
3.2Error rates for deeper ReLU discriminator networks
In the GAN formulation, the discriminator class is represented by functions realized by a neural network with a certain architecture. In this section, we will apply our theory to obtain bounds for a wide range of feed-forward multi-layer networks with ReLU activation as the discriminator metric.
Denote as the ReLU activation. Consider the following feed-forward multi-layer network:
The network has layers, input
There is constant such that for each unit in the network, the vector of weights associated with that unit has .
Mathematically, one can define the function class induced by the network through the recursive definition: , and for any ,
For any , we know that due to the optimality of GAN
Therefore for any ,
where we use the following fact and
It is easy to bound the following using similar logic
If consists of -Lipschitz functions (Wasserstein GAN) on , , plug in the -covering number bound for Lipschitz functions,
This matches the best known bound as in  (Section 2.1.1).
Let’s consider when denotes Sobolev space on . Recall the entropy number estimate for , we have
Remark in addition that the parametric rate is inevitable, which can be easily seen from the Sudakov minoration,
Denote as the density ratio, let’s construct the following smoothed version , with a cut-off parameter to be determined later. Define the regularized/smoothed density estimate
where based on i.i.d. samples
In other words, filters out all the high frequency (less smooth) components, when the multi-index has largest coordinate larger than . Now, for any , write the Fourier coefficients of
For any (or equivalently ), we have
For the first term,
as we know
Note for trigonometric series .
For the second term, the following inequality holds
Combining two terms, we have for any (or equivalently ),
And the optimal choice of . Note that this is a more aggressive smoothing scheme than classic nonparametric estimation with smoothness , due to the fact that we are utilizing the smoothness in the evaluation metric at the same time.
The proof uses the standard Fano’s inequality.
Let’s construct a mixture of hypothesis on the function space , and a subset of discriminator functions in , such that the multiple testing problem among the mixture of hypothesis is hard, and thus the loss induced by the best discriminator among the subset provides a lower bound on the estimation rate. Let’s construct this mixture in the frequency domain. Choose to be determined later, denote the hypothesis class of interest to be
It is easy to verify that because for any , we have
Similarly, let’s consider
Take any , we know that
where denotes the Hamming distance between and on the hypercube . Now we need to construct a subset over the hypercube such that for any pairs , they are separated in terms of Hamming distance.
The Varshamov-Gilbert bound (Lemma 2.9 in ) does the job. We know that there exist a subset such that ,
In our case .
Now let’s calculate the probability distance induced by hypothesis and , for all to show that information theoretically, it is hard to distinguish the mixture of hypothesis. The following holds,
If we choose , we know
In this case, for any
Therefore using Fano’s Lemma, we reach