Some Theoretical Properties of GANs

# Some Theoretical Properties of GANs

G. Biau
Sorbonne Université, CNRS, LPSM
Paris, France
gerard.biau@upmc.fr
Univ Rennes, CNRS, IRMAR
Rennes, France
&M. Sangnier
Sorbonne Université, CNRS, LPSM
Paris, France
maxime.sangnier@upmc.fr
&U. Tanielian
Sorbonne Université, CNRS, LPSM, Criteo
Paris, France
u.tanielian@criteo.com
###### Abstract

Generative Adversarial Networks (GANs) are a class of generative algorithms that have been shown to produce state-of-the art samples, especially in the domain of image creation. The fundamental principle of GANs is to approximate the unknown distribution of a given data set by optimizing an objective function through an adversarial game between a family of generators and a family of discriminators. In this paper, we offer a better theoretical understanding of GANs by analyzing some of their mathematical and statistical properties. We study the deep connection between the adversarial principle underlying GANs and the Jensen-Shannon divergence, together with some optimality characteristics of the problem. An analysis of the role of the discriminator family via approximation arguments is also provided. In addition, taking a statistical point of view, we study the large sample properties of the estimated distribution and prove in particular a central limit theorem. Some of our results are illustrated with simulated examples.

Some Theoretical Properties of GANs

G. Biau Sorbonne Université, CNRS, LPSM Paris, France gerard.biau@upmc.fr B. Cadre Univ Rennes, CNRS, IRMAR Rennes, France benoit.cadre@ens-rennes.fr M. Sangnier Sorbonne Université, CNRS, LPSM Paris, France maxime.sangnier@upmc.fr U. Tanielian Sorbonne Université, CNRS, LPSM, Criteo Paris, France u.tanielian@criteo.com

\@float

noticebox[b]\end@float

## 1 Introduction

The fields of machine learning and artificial intelligence have seen spectacular advances in recent years, one of the most promising being perhaps the success of Generative Adversarial Networks (GANs), introduced by Goodfellow et al. (2014). GANs are a class of generative algorithms implemented by a system of two neural networks contesting with each other in a zero-sum game framework. This technique is now recognized as being capable of generating photographs that look authentic to human observers (e.g., Salimans et al., 2016), and its spectrum of applications is growing at a fast pace, with impressive results in the domains of inpainting, speech, and 3D modeling, to name but a few. A survey of the most recent advances is given by Goodfellow (2016).

The objective of GANs is to generate fake observations of a target distribution from which only a true sample (e.g., real-life images represented using raw pixels) is available. It should be pointed out at the outset that the data involved in the domain are usually so complex that no exhaustive description of by a classical parametric model is appropriate, nor its estimation by a traditional maximum likelihood approach. Similarly, the dimension of the samples is often very large, and this effectively excludes a strategy based on nonparametric density estimation techniques such as kernel or nearest neighbor smoothing, for example. In order to generate according to , GANs proceed by an adversarial scheme involving two components: a family of generators and a family of discriminators, which are both implemented by neural networks. The generators admit low-dimensional random observations with a known distribution (typically Gaussian or uniform) as input, and attempt to transform them into fake data that can match the distribution ; on the other hand, the discriminators aim to accurately discriminate between the true observations from and those produced by the generators. The generators and the discriminators are calibrated by optimizing an objective function in such a way that the distribution of the generated sample is as indistinguishable as possible from that of the original data. In pictorial terms, this process is often compared to a game of cops and robbers, in which a team of counterfeiters illegally produces banknotes and tries to make them undetectable in the eyes of a team of police officers, whose objective is of course the opposite. The competition pushes both teams to improve their methods until counterfeit money becomes indistinguishable (or not) from genuine currency.

From a mathematical point of view, here is how the generative process of GANs can be represented. All the densities that we consider in the article are supposed to be dominated by a fixed, known, measure on , where is a Borel subset of . This dominating measure is typically the Lebesgue or the counting measure, but, depending on the practical context, it can be a more complex measure. We assume to have at hand an i.i.d. sample , drawn according to some unknown density on . These random variables model the available data, such as images or video sequences; they typically take their values in a high-dimensional space, so that the ambient dimension must be thought of as large. The generators as a whole have the form of a parametric family of functions from to (), say , . Each function is intended to be applied to a -dimensional random variable (sometimes called the noise—in most cases Gaussian or uniform), so that there is a natural family of densities associated with the generators, say , where, by definition, . In this model, each density is a potential candidate to represent . On the other hand, the discriminators are described by a family of Borel functions from to , say , where each must be thought of as the probability that an observation comes from (the higher , the higher the probability that is drawn from ). At some point, but not always, we will assume that is in fact a parametric class, of the form , , as is certainly always the case in practice. In GANs algorithms, both parametric models and take the form of neural networks, but this does not play a fundamental role in this paper. We will simply remember that the dimensions and are potentially very large, which takes us away from a classical parametric setting. We also insist on the fact that it is not assumed that belongs to .

Let be an i.i.d. sample of random variables, all distributed as the noise . The objective is to solve in the problem

 infθ∈ΘsupD∈D[n∏i=1D(Xi)×n∏i=1(1−D∘Gθ(Zi))],

or, equivalently, to find such that

 supD∈D^L(^θ,D)≤supD∈D^L(θ,D),∀θ∈Θ, (1)

where

 ^L(θ,D){def}=n∑i=1lnD(Xi)+n∑i=1ln(1−D∘Gθ(Zi))

( is the natural logarithm). In this problem, represents the probability that an observation comes from rather than from . Therefore, for each , the discriminators (the police team) try to distinguish the original sample from the fake one produced by the generators (the counterfeiters’ team), by maximizing on the and minimizing it on the . Of course, the generators have an exact opposite objective, and adapt the fake data in such a way as to mislead the discriminators’ likelihood. All in all, we see that the criterion seeks to find the right balance between the conflicting interests of the generators and the discriminators. The hope is that the achieving equilibrium will make it possible to generate observations indistinguishable from reality, i.e., observations with a law close to the unknown .

The criterion involved in (1) is the criterion originally proposed in the adversarial framework of Goodfellow et al. (2014). Since then, the success of GANs in applications has led to a large volume of literature on variants, which all have many desirable properties but are based on different optimization criteria—examples are MMD-GANs (Dziugaite et al., 2015), f-GANs (Nowozin et al., 2016), Wasserstein-GANs (Arjovsky et al., 2017), and an approach based on scattering transforms (Angles and Mallat, 2018). All these variations and their innumerable algorithmic versions constitute the galaxy of GANs. That being said, despite increasingly spectacular applications, little is known about the mathematical and statistical forces behind these algorithms (e.g., Arjovsky and Bottou, 2017; Liu et al., 2017; Zhang et al., 2018), and, in fact, nearly nothing about the primary adversarial problem (1). As acknowledged by Liu et al. (2017), basic questions on how well GANs can approximate the target distribution remain largely unanswered. In particular, the role and impact of the discriminators on the quality of the approximation are still a mystery, and simple but fundamental questions regarding statistical consistency and rates of convergence remain open.

In the present article, we propose to take a small step towards a better theoretical understanding of GANs by analyzing some of the mathematical and statistical properties of the original adversarial problem (1). In Section 2, we study the deep connection between the population version of (1) and the Jensen-Shannon divergence, together with some optimality characteristics of the problem, often referred to in the literature but in fact poorly understood. Section 3 is devoted to a better comprehension of the role of the discriminator family via approximation arguments. Finally, taking a statistical point of view, we study in Section 4 the large sample properties of the distribution and , and prove in particular a central limit theorem for this parameter. Some of our results are illustrated with simulated examples. For clarity, most technical proofs are gathered in Section 5.

## 2 Optimality properties

We start by studying some important properties of the adversarial principle, emphasizing the role played by the Jensen-Shannon divergence. We recall that if and are probability measures on , and is absolutely continuous with respect to , then the Kullback-Leibler divergence from to is defined as

 DKL(P∥Q)=∫lndPdQdP,

where is the Radon-Nikodym derivative of with respect to . The Kullback-Leibler divergence is always nonnegative, with zero if and only if . If and exist (meaning that and are absolutely continuous with respect to , with densities and ), then the Kullback-Leibler divergence is given as

 DKL(P∥Q)=∫plnpqdμ,

and alternatively denoted by . We also recall that the Jensen-Shannon divergence is a symmetrized version of the Kullback-Leibler divergence. It is defined for any probability measures and on by

 DJS(P,Q)=12DKL(P∥∥P+Q2)+12DKL(Q∥∥P+Q2),

and satisfies . The square root of the Jensen-Shannon divergence is a metric often referred to as Jensen-Shannon distance (Endres and Schindelin, 2003). When and have densities and with respect to , we use the notation in place of .

For a generator and an arbitrary discriminator , the criterion to be optimized in (1) is but the empirical version of the probabilistic criterion

 L(θ,D){def}=∫ln(D)p⋆dμ+∫ln(1−D)pθdμ.

We assume for the moment that the discriminator class is not restricted and equals , the set of all Borel functions from to . We note however that, for all ,

 0≥supD∈D∞L(θ,D)≥−ln2(∫p⋆dμ+∫pθdμ)=−ln4,

so that . Thus,

 infθ∈ΘsupD∈D∞L(θ,D)=infθ∈ΘsupD∈D∞:L(θ,D)>−∞L(θ,D).

This identity points out the importance of discriminators such that , which we call -admissible. In the sequel, in order to avoid unnecessary problems of integrability, we only consider such discriminators, keeping in mind that the others have no interest.

Of course, working with is somehow an idealized vision, since in practice the discriminators are always parameterized by some parameter , . Nevertheless, this point of view is informative and, in fact, is at the core of the connection between our generative problem and the Jensen-Shannon divergence. Indeed, taking the supremum of over , we have

 supD∈D∞L(θ,D) =supD∈D∞∫[ln(D)p⋆+ln(1−D)pθ]dμ ≤∫supD∈D∞[ln(D)p⋆+ln(1−D)pθ]dμ =L(θ,D⋆θ),

where

 D⋆θ{def}=p⋆p⋆+pθ. (2)

By observing that , we conclude that, for all ,

 supD∈D∞L(θ,D)=L(θ,D⋆θ)=2DJS(p⋆,pθ)−ln4.

In particular, is -admissible. The fact that realizes the supremum of over and that this supremum is connected to the Jensen-Shannon divergence between and appears in the original article by Goodfellow et al. (2014). This remark has given rise to many developments that interpret the adversarial problem (1) as the empirical version of the minimization problem over . Accordingly, many GANs algorithms try to learn the optimal function , using for example stochastic gradient descent techniques and mini-batch approaches. However, it has not been known until now whether is unique as a maximizer of over all . Our first result shows that this is indeed the case.

###### Theorem 2.1.

Let be such that -almost everywhere. Then the function is the unique discriminator that achieves the supremum of the functional over , i.e.,

 {D⋆θ}=argmaxD∈D∞L(θ,D).
###### Proof.

Let be a discriminator such that . In particular, and is -admissible. We have to show that . Notice that

 ∫ln(D)p⋆dμ+∫ln(1−D)pθdμ=∫ln(D⋆θ)p⋆dμ+∫ln(1−D⋆θ)pθdμ. (3)

Thus,

 −∫ln(D⋆θD)p⋆dμ=∫ln(1−D⋆θ1−D)pθdμ,

i.e., by definition of ,

 −∫ln(p⋆D(p⋆+pθ))p⋆dμ=∫ln(pθ(1−D)(p⋆+pθ))pθdμ. (4)

Let , ,

 dκ=D(p⋆+pθ)∫D(p⋆+pθ)dμdμ,anddκ′=(1−D)(p⋆+pθ)∫(1−D)(p⋆+pθ)dμdμ.

With this notation, identity (4) becomes

 −DKL(P⋆∥κ)+ln[∫D(p⋆+pθ)dμ]=DKL(Pθ∥κ′)−ln[∫(1−D)(p⋆+pθ)dμ].

Upon noting that

 ∫(1−D)(p⋆+pθ)dμ=2−∫D(p⋆+pθ)dμ,

we obtain

 DKL(P⋆∥κ)+DKL(Pθ∥κ′)=ln[∫D(p⋆+pθ)dμ(2−∫D(p⋆+pθ)dμ)].

Since , we find that , which implies

 DKL(P⋆∥κ)=0andDKL(Pθ∥κ′)=0.

Consequently,

 p⋆=D(p⋆+pθ)∫D(p⋆+pθ)dμandpθ=(1−D)(p⋆+pθ)2−∫D(p⋆+pθ)dμ,

that is,

 ∫D(p⋆+pθ)dμ=D(p⋆+pθ)p⋆and1−D=pθp⋆+pθ(2−∫D(p⋆+pθ)dμ).

We conclude that

 1−D=pθp⋆+pθ(2−D(p⋆+pθ)p⋆),

i.e., whenever .

To complete the proof, it remains to show that -almost everywhere on the set . Using the result above together with equality (3), we see that

 ∫Aln(D)p⋆dμ+∫Aln(1−D)pθdμ=∫Aln(1/2)p⋆dμ+∫Aln(1/2)pθdμ,

that is,

 ∫A[ln(1/4)−ln(D(1−D))]pθdμ=0.

Observing that since takes values in , we deduce that -almost everywhere. Therefore, on the set , since -almost everywhere by assumption. ∎

By definition of the optimal discriminator , we have

 L(θ,D⋆θ)=supD∈D∞L(θ,D)=2DJS(p⋆,pθ)−ln4,∀θ∈Θ.

Therefore, it makes sense to let the parameter be defined as

 L(θ⋆,D⋆θ⋆)≤L(θ,D⋆θ),∀θ∈Θ,

or, equivalently,

 DJS(p⋆,pθ⋆)≤DJS(p⋆,pθ),∀θ∈Θ. (5)

The parameter may be interpreted as the best parameter in for approaching the unknown density in terms of Jensen-Shannon divergence, in a context where all possible discriminators are available. In other words, the generator is the ideal generator, and the density is the one we would ideally like to use to generate fake samples. Of course, whenever (i.e., the target density is in the model), then , , and . This is, however, a very special case, which is of no interest, since in the applications covered by GANs, the data are usually so complex that the hypothesis does not hold.

In the general case, our next theorem provides sufficient conditions for the existence and unicity of . For and probability measures on , we let , and recall that is a distance on the set of probability measures on (Endres and Schindelin, 2003). We let and, for all , .

###### Theorem 2.2.

Assume that the model is identifiable, convex, and compact for the metric . Assume, in addition, that there exist such that and, for all , . Then there exists a unique such that

 {θ⋆}=argminθ∈ΘL(θ,D⋆θ),

or, equivalently,

 {θ⋆}=argminθ∈ΘDJS(p⋆,pθ).
###### Proof.

Observe that since and . Recall that . By identifiability of , it is enough to prove that there exists a unique density of such that

 {pθ⋆}=argminp∈PDJS(p⋆,p).

Existence. Since is compact for , it is enough to show that the function

 {Pθ}θ∈Θ→\mathdsR+P↦DJS(P⋆,P)

is continuous. But this is clear since, for all , by the triangle inequality. Therefore, .

Unicity. For , we consider the function defined by

 Fa(x)=aln(2aa+x)+xln(2xa+x),x∈[0,M],

with the convention . Clearly, , which shows that is -strongly convex, with independent of . Thus, for all , all , and ,

 Fa(λx1+(1−λ)x2)≤λFa(x1)+(1−λ)Fa(x2)−β2λ(1−λ)(x1−x2)2.

Thus, for all with , and for all ,

 DJS(p⋆,λp1+(1−λ)p2) =∫Fp⋆(λp1+(1−λ)p2)dμ ≤λDJS(p⋆,p1)+(1−λ)DJS(p⋆,p2)−β2λ(1−λ)∫(p1−p2)2dμ <λDJS(p⋆,p1)+(1−λ)DJS(p⋆,p2).

In the last inequality, we used the fact that is positive and finite since for all . We conclude that the function is strictly convex. Therefore, its is either the empty set or a singleton. ∎

###### Remark 2.1.

There are simple conditions for the model to be compact for the metric . It is for example enough to suppose that is compact, is convex, and

1. For all , the function is continuous on ;

2. One has .

Let us quickly check that under these conditions, is compact for the metric . Since is compact, by the sequential characterization of compact sets, it is enough to prove that if converges to , then . But,

 DJS(pθ,pθn)=∫[pθln(2pθpθ+pθn)+pθnln(2pθnpθ+pθn)]dμ.

By the convexity of , using and , the Lebesgue dominated convergence theorem shows that , whence the result.

Interpreting the adversarial problem in connection with the optimization program is a bit misleading, because this is based on the assumption that all possible discriminators are available (and in particular the optimal discriminator ). In the end this means assuming that we know the distribution , which is eventually not acceptable from a statistical perspective. In practice, the class of discriminators is always restricted to be a parametric family , , and it is with this class that we have to work. From our point of view, problem (1) is a likelihood-type problem involving two parametric families and , which must be analyzed as such, just as we would do for a classical maximum likelihood approach. In fact, it takes no more than a moment’s thought to realize that the key lies in the approximation capabilities of the discriminator class with respect to the functions , . This is the issue that we discuss in the next section.

## 3 Approximation properties

In the remainder of the article, we assume that exists, keeping in mind that Theorem 2.2 provides us with precise conditions guaranteeing its existence and its unicity. As pointed out earlier, in practice only a parametric class , , is available, and it is therefore logical to consider the parameter defined by

 supD∈DL(¯θ,D)≤supD∈DL(θ,D),∀θ∈Θ.

(We assume for now that exists—sufficient conditions for this existence, relating to compactness of and regularity of the model , will be given in the next section.) The density is thus the best candidate to imitate , given the parametric families of generators and discriminators . The natural question is then: is it possible to quantify the proximity between and the ideal via the approximation properties of the class ? In other words, if is growing, is it true that approaches , and in the affirmative, in which sense and at which speed? Theorem 3.1 below provides a first answer to this important question, in terms of the difference . To state the result, we will need some assumptions.

Assumption There exists a positive constant such that

 min(D⋆θ,1−D⋆θ)≥t–,∀θ∈Θ.

We note that this assumption implies that, for all ,

 t–1−t–p⋆≤pθ≤1−t–t–p⋆.

It is a mild requirement, which implies in particular that for any , and have the same support, independent of .

Let be the supremum norm of functions on . Our next condition guarantees that the parametric class is rich enough to approach the discriminator .

Assumption There exists and , a -admissible discriminator, such that .

We are now equipped to state our approximation theorem. For ease of reading, its proof is postponed to Section 5.

###### Theorem 3.1.

Under Assumptions and , there exists a positive constant (depending only upon ) such that

 0≤DJS(p⋆,p¯θ)−DJS(p⋆,pθ⋆)≤cε2. (6)

This theorem points out that if the class is rich enough to approximate the discriminator in such a way that for some small , then replacing by has an impact which is not larger than a factor. It shows in particular that the Jensen-Shannon divergence is a suitable criterion for the problem we are examining.

## 4 Statistical analysis

The data-dependent parameter achieves the infimum of the adversarial problem (1). Practically speaking, it is this parameter that will be used in the end for producing fake data, via the associated generator . We first study in Subsection 4.1 the large sample properties of the distribution via the criterion , and then state in Subsection 4.2 the almost sure convergence and asymptotic normality of the parameter as the sample size tends to infinity. Throughout, the parameter sets and are assumed to be compact subsets of and , respectively. To simplify the analysis, we also assume that .

### 4.1 Asymptotic properties of DJS(p⋆,p^θ)

As for now, we assume that we have at hand a parametric family of generators , , and a parametric family of discriminators , . We recall that the collection of probability densities associated with is , where and is some low-dimensional noise random variable. In order to avoid any confusion, for a given discriminator we use the notation (respectively, ) instead of (respectively, ) when useful. So,

 ^L(θ,α)=n∑i=1lnDα(Xi)+n∑i=1ln(1−Dα∘Gθ(Zi)),

and

 L(θ,α)=∫ln(Dα)p⋆dμ+∫ln(1−Dα)pθdμ.

We will need the following regularity assumptions:

Assumptions

1. There exists such that, for all , . In addition, the function is of class , with a uniformly bounded differential.

2. For all , the function is of class , uniformly bounded, with a uniformly bounded differential.

3. For all , the function is of class , uniformly bounded, with a uniformly bounded differential.

Note that under , all discriminators in are -admissible, whatever . All of these requirements are classic regularity conditions for statistical models, which imply in particular that the functions and are continuous. Therefore, the compactness of guarantees that and exists. Conditions for the existence of are given in Theorem 2.2.

We have known since Theorem 3.1 that if the available class of discriminators approaches the optimal discriminator by a distance not more than , then . It is therefore reasonable to expect that, asymptotically, the difference will not be larger than a term proportional to , in some probabilistic sense. This is precisely the result of Theorem 4.1 below. In fact, most articles to date have focused on the development and analysis of optimization procedures (typically, stochastic-gradient-type algorithms) to compute , without really questioning its convergence properties as the data set grows. Although our statistical results are theoretical in nature, we believe that they are complementary to the optimization literature, insofar as they offer guarantees on the validity of the algorithms.

In addition to the regularity hypotheses and Assumption , we will need the following requirement, which is a stronger version of :

Assumption There exists such that: for all , there exists , a -admissible discriminator, such that .

We are ready to state our first statistical theorem.

###### Theorem 4.1.

Under Assumptions , , and , one has

 \mathdsEDJS(p⋆,p^θ)−DJS(p⋆,pθ⋆)=O(ε2+1√n).
###### Proof.

Fix as in Assumption , and choose , a -admissible discriminator, such that . By repeating the arguments of the proof of Theorem 3.1 (with instead of ), we conclude that there exists a constant such that

 2DJS(p⋆,p^θ)≤c1ε2+L(^θ,^D)+ln4≤c1ε2+supα∈ΛL(^θ,α)+ln4.

Therefore,

 2DJS(p⋆,p^θ) ≤c1ε2+supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|+supα∈Λ^L(^θ,α)+ln4 =c1ε2+supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|+infθ∈Θsupα∈Λ^L(θ,α)+ln4 (by definition of ^θ) ≤c1ε2+2supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|+infθ∈Θsupα∈ΛL(θ,α)+ln4.

So,

 2DJS(p⋆,p^θ) ≤c1ε2+2supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|+infθ∈ΘsupD∈D∞L(θ,D)+ln4 =c1ε2+2supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|+L(θ⋆,D⋆θ⋆)+ln4 (by definition of θ⋆) =c1ε2+2DJS(p⋆,pθ⋆)+2supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|.

Thus, letting , we have

 DJS(p⋆,p^θ)−DJS(p⋆,pθ⋆)≤c2ε2+supθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|. (7)

Clearly, under Assumptions , , and , the process is subgaussian (e.g., van Handel, 2016, Chapter 5) for the distance , where is the standard Euclidean norm on . Let denote the -covering number of for the distance . Then, by Dudley’s inequality (van Handel, 2016, Corollary 5.25),

 \mathdsEsupθ∈Θ,α∈Λ|^L(θ,α)−L(θ,α)|≤12√n∫∞0√ln(N(Θ×Λ,∥⋅∥,u))du. (8)

Since and are bounded, there exists such that for and

 N(Θ×Λ,∥⋅∥,u)=O((1u)p+q)for u

Combining this inequality with (7) and (8), we obtain

 \mathdsEDJS(p⋆,p^θ)−DJS(p⋆,pθ⋆)≤c3(ε2+1√n),

for some positive constant . The conclusion follows by observing that, by (5),

 DJS(p⋆,pθ⋆)≤DJS(p⋆,p^θ).

Theorem 4.1 is illustrated in Figure 1, which shows the approximate values of . We took (centered logistic density with scale parameter ), and let and be two fully connected neural networks parameterized by weights and offsets. The noise random variable follows a uniform distribution on , and the parameters of and are chosen in a sufficiently large compact set. In order to illustrate the impact of in Theorem 4.1, we fixed the sample size to a large and varied the number of layers of the discriminators from 2 to 5, keeping in mind that a larger number of layers results in a smaller . To diversify the setting, we also varied the number of layers of the generators from 2 to 3. The expectation was estimated by averaging over 30 repetitions (the number of runs has been reduced for time complexity limitations). Note that we do not pay attention to the exact value of the constant term , which is intractable in our setting.