Smoothness and Stability in GANs

# Smoothness and Stability in GANs

## Abstract

Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator’s architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions.

\iclrfinalcopy

## 1 Introduction: taming instability with smoothness

Generative adversarial networks (Goodfellow et al., 2014), or GANs, are a powerful class of generative models defined through minimax game. GANs and their variants have shown impressive performance in synthesizing various types of datasets, especially natural images. Despite these successes, the training of GANs remains quite unstable in nature, and this instability remains difficult to understand theoretically.

Since the introduction of GANs, there have been many techniques proposed to stabilize GANs training, including studies of new generator/discriminator architectures, loss functions, and regularization techniques. Notably, Arjovsky et al. (2017) proposed Wasserstein GAN (WGAN), which in principle avoids instability caused by mismatched generator and data distribution supports. In practice, this is enforced by Lipschitz constraints, which in turn motivated developments like gradient penalties (Gulrajani et al., 2017) and spectral normalization (Miyato et al., 2018). Indeed, these stabilization techniques have proven essential to achieving the latest state-of-the-art results (Karras et al., 2018; Brock et al., 2019).

On the other hand, a solid theoretical understanding of training stability has not been established. Several empirical observations point to an incomplete understanding. For example, why does applying a gradient penalty together spectral norm seem to improve performance (Miyato et al., 2018), even though in principle they serve the same purpose? Why does applying only spectral normalization with the Wasserstein loss fail (Miyato, 2018), even though the analysis of Arjovsky et al. (2017) suggests it should be sufficient? Why is applying gradient penalties effective, even outside their original context of the Wasserstein GAN (Fedus et al., 2018)?

In this work, we develop a framework to analyze the stability of GAN training that resolves these apparent contradictions and clarifies the roles of these regularization techniques. Our approach considers the smoothness of the loss function used. In optimization, smoothness is a well-known condition that ensures that gradient descent and its variants become stable (see e.g., Bertsekas (1999)). For example, the following well-known proposition is the starting point of our stability analysis:

{restatable}

[Bertsekas (1999), Proposition 1.2.3]propositionPropGradientDescent Suppose is -smooth and bounded below. Let . Then as .

This proposition says that under a smoothness condition on the function, gradient descent with a constant step size approaches stationarity (i.e., the gradient norm approaches zero). This is a rather weak notion of convergence, as it does not guarantee that the iterates converge to a point, and even if the iterates do converge, the limit is a stationary point and not necessarily an minimizer.

Nevertheless, empirically, not even this stationarity is satisfied by GANs, which are known to frequently destabilize and diverge during training. To diagnose this instability, we consider the smoothness of the GAN’s loss function. GANs are typically framed as minimax problems of the form

 \vspace−1mminfθsupφJ(μθ,φ),\vspace−1mm (1)

where is a loss function that takes a generator distribution and discriminator , and denotes the parameters of the generator. Unfortunately, the minimax nature of this problem makes stability and convergence difficult to analyze. To make the analysis more tractable, we define , so that (1) becomes simply

 \vspace−2mminfθJ(μθ).\vspace2mm (2)

This choice corresponds to the common assumption that the discriminator is allowed to reach optimality at every training step. Now, the GAN algorithm can be regarded as simply gradient descent on the function , which may be analyzed using Section 1. In particular, if this function satisfies the smoothness assumption, then the GAN training should be stable in that it should approach stationarity under the assumption of an optimal discriminator.

In the remainder of this paper, we investigate whether the smoothness assumption is satisfied for various GAN losses. Our analysis answers two questions:

1. Which existing GAN losses, if any, satisfy the smoothness condition in Section 1?

2. Are there choices of loss, regularization, or architecture that enforce smoothness in GANs?

As results of our analysis, our contributions are as follows:

1. We derive sufficient conditions for the GAN algorithm to be stationary under certain assumptions (Section 2). Our conditions relate to the smoothness of GAN loss used as well as the parameterization of the generator.

2. We show that most common GAN losses do not satisfy the all of the smoothness conditions, thereby corroborating their empirical instability.

3. We develop regularization techniques that enforce the smoothness conditions. These regularizers recover common GAN stabilization techniques such as gradient penalties and spectral normalization, thereby placing their use on a firmer theoretical foundation.

4. Our analysis provides several practical insights, suggesting for example the use of smooth activation functions, simultaneous spectral normalization and gradient penalties, and a particular learning rate for the generator.

### 1.1 Related work

Divergence minimization Our analysis regards the GAN algorithm as minimizing a divergence between the current generator distribution and the desired data distribution, under the assumption of an optimal discriminator at every training step. This perspective originates from the earliest GAN paper, in which Goodfellow et al. (2014) show that the original minimax GAN implicitly minimizes the Jensen–Shannon divergence. Since then, the community has introduced a large number of GAN or GAN-like variants that learn generative models by implicitly minimizing various divergences, including -divergences (Nowozin et al., 2016), Wasserstein distance (Arjovsky et al., 2017), and maximum-mean discrepancy (Li et al., 2015; Unterthiner et al., 2018). Meanwhile, the non-saturating GAN (Goodfellow et al., 2014) has been shown to minimize a certain Kullback–Leibler divergence (Arjovsky and Bottou, 2017). Several more theoretical works consider the topological, geometric, and convexity properties of divergence minimization (Arjovsky and Bottou, 2017; Liu et al., 2017; Bottou et al., 2018; Farnia and Tse, 2018; Chu et al., 2019), perspectives that we draw heavily upon. Sanjabi et al. (2018) also prove smoothness of GAN losses in the specific case of the regularized optimal transport loss. Their assumption for smoothness is entangled in that it involves a composite condition on generators and discriminators, while our analysis addresses them separately.

Other approaches Even though many analyses, including ours, operate under the assumption of an optimal discriminator, this assumption is unrealistic in practice. Li et al. (2017b) contrast this optimal discriminator dynamics with first-order dynamics, which assumes that the generator and discriminator use alternating gradient updates and is what is used computationally. As this is a differing approach from ours, we only briefly mention some results in this area, which typically rely on game-theoretic notions (Kodali et al., 2017; Grnarova et al., 2018; Oliehoek et al., 2018) or local analysis (Nagarajan and Kolter, 2017; Mescheder et al., 2018). Some of these results rely on continuous dynamics approximations of gradient updates; in contrast, our work focuses on discrete dynamics.

### 1.2 Notation

Let . We let denote the set of all probability measures on a compact set . We let and denote the dual pair consisting of the set of all finite signed measures on and the set of all continuous functions . For any statement , we let be if is true and if is false. For a Euclidean vector , its Euclidean norm is denoted by , and the operator norm of a matrix is denoted by , i.e., . A function between two metric spaces is -Lipschitz if . A function is -smooth if its gradients are -Lipschitz, that is, for all , .

## 2 Smoothness of GAN losses

This section presents Section 2, which provides concise criteria for the smoothness of GAN losses.

In order to keep our analysis agnostic to the particular GAN used, let be an arbitrary convex loss function, which takes a distribution over and outputs a real number. Note that the typical minimax formulation of GANs can be recovered from just the loss function using convex duality. In particular, recall that the convex conjugate of satisfies the following remarkable duality, known as the Fenchel–Moreau theorem:

 J⋆(φ):=supμ∈M(X)∫φ(x)dμ−J(μ),J(μ)=supφ∈C(X)∫φ(x)dμ−J⋆(φ). (3)

Based on this duality, minimizing can be framed as the minimax problem

 infμ∈P(X)J(μ)=infμ∈P(X)supφ∈C(X)∫φ(x)dμ−J⋆(φ):=infμ∈P(X)supφ∈C(X)J(μ,φ), (4)

recovering the well-known adversarial formulation of GANs. We now define the notion of an optimal discriminator for an arbitrary loss function , based on this convex duality:

###### Definition 1 (Optimal discriminator).

Let be a convex, l.s.c., proper function. An optimal discriminator for a probability distribution is a continuous function that attains the maximum of the second equation in (3), i.e., .

This definition recovers the optimal discriminators of many existing GAN and GAN-like algorithms (Farnia and Tse, 2018; Chu et al., 2019), most notably those in Table 1. Our analysis will apply to any algorithm in this family of algorithms. See Appendix B for more details on this perspective.

We also formalize the notion of a family of generators:

###### Definition 2 (Family of generators).

A family of generators is a set of pushforward probability measures , where is a fixed probability distribution on (the latent variable) and is a measurable function (the generator).

Now, in light of Section 1, we are interested in the smoothness of the mapping , which would guarantee the stationarity of gradient descent on this objective, which in turn implies stationarity of the GAN algorithm under the assumption of an optimal discriminator. The following theorem is our central result, which decomposes the smoothness of into conditions on optimal discriminators and the family of generators.

{restatable}

[Smoothness decomposition for GANs]theoremThmStability Let be a convex function whose optimal discriminators satisfy the following regularity conditions:

• is -Lipschitz,

• is -Lipschitz,

• is -Lipschitz w.r.t. the -Wasserstein distance.

Also, let be a family of generators that satisfies:

• is -Lipschitz in expectation for , i.e., , and

• is -Lipschitz in expectation for , i.e., .

Then is -smooth, with .

Section 2 connects the smoothness properties of the loss function with the smoothness properties of the optimal discriminator , and once paired with Section 1, it suggests a quantitative value for a stable generator learning rate. In order to obtain claims of stability for practically sized learning rates, it is important to tightly bound the relevant constants.

In Sections 6, 5 and 4, we carefully analyze which GAN losses satisfy (D1), (D2), and (D3), and with what constants. We summarize our results in Table 2: it turns out that none of the listed losses, except for one, satisfy (D1), (D2), and (D3) simultaneously with a finite constant. The MMD-based loss satisfies the three conditions, but its constant for (D1) grows as , which is an unfavorable dependence on the data dimension that forces an unacceptably small learning rate. See for complete details of each condition. This failure of existing GANs to satisfy the stationarity conditions corroborates the observed instability of GANs.

Section 2 decomposes smoothness into conditions on the generator and conditions on the discriminator, allowing a clean separation of concerns. In this paper, we focus on the discriminator conditions (D1), (D2), and (D3) and only provide an extremely simple example of a generator that satisfies (G1) and (G2), in Section 7. Because analysis of the generator conditions may become quite complicated and will vary with the choice of architecture considered (feedforward, convolutional, ResNet, etc.), we leave a detailed analysis of the generator conditions (G1) and (G2) as a promising avenue for future work. Indeed, such analyses may lead to new generator architectures or generator regularization techniques that stabilize GAN training.

## 3 Enforcing smoothness with inf-convolutions

In this section, we present a generic regularization technique that imposes the three conditions sufficient for stable learning on an arbitrary loss function , thereby stabilizing training. In Section 2, we observe that the Wasserstein, IPM, and MMD losses respectively satisfy (D1), (D2), and (D3) individually, but not all of of them at the same time. Using techniques from convex analysis, we convert these three GAN losses into three regularizers that, when applied simultaneously, causes the resulting loss to satisfy all the three conditions. Here, we only outline the technique; the specifics of each case are deferred to Sections 6, 5 and 4.

We start with an arbitrary base loss function to be regularized. Next, we take an existing GAN loss that satisfies the desired regularity condition and convert it into a regularizer function . Then, we consider , which denotes the inf-convolution defined as

 (J⊕R)(ξ)=inf~ξ∈M(X)J(~ξ)+R(ξ−~ξ). (5)

This new function inherits the regularity of , making it a stable candidate as a GAN loss. Moreover, because the inf-convolution is a commutative operation, we can sequentially apply multiple regularizers , , and without destroying the added regularity. In particular, if we carefully choose functions , , and , then will satisfy (D1), (D2), and (D3) simultaneously. Moreover, under some technical assumptions, this composite function inherits the original minimizers of , making it a sensible GAN loss:

{restatable}

[Invariance of minimizers]propositionPropSuperGANInvariance Let , , and be the three regularizers defined by (8), (12), and (19) respectively. Assume that has a unique minimizer at with , and for some . Then the inf-convolution has a unique minimizer at with .

The duality formulation (4) provides a practical method for minimizing this composite function. We leverage the duality relation and apply (4):

 infμ(J⊕R1⊕R2⊕R3)(μ) =infμsupφ∫φdμ−J⋆(φ)−R⋆1(φ)−R⋆2(φ)−R⋆3(φ) (6) =infμsupφJ(μ,φ)−R⋆1(φ)−R⋆2(φ)−R⋆3(φ). (7)

This minimax problem can be seen as a GAN whose discriminator objective has three added regularization terms.

The concrete form of these regularizers are summarized in Table 3. Notably, we observe that we recover standard techniques for stabilizing GANs:

• (D1) is enforced by Lipschitz constraints (i.e., spectral normalization) on the discriminator.

• (D2) is enforced by spectral normalization and a choice of Lipschitz, smooth activation functions for the discriminator.

• (D3) is enforced by gradient penalties on the discriminator.

Our analysis therefore puts these regularization techniques on a firm theoretical foundation (Proposition 1 and Theorem 2) and provides insight into their function.

## 4 Enforcing (D1) with Lipschitz constraints

In this section, we show that enforcing (D1) leads to techniques and notions commonly used to stabilize GANs, including the Wasserstein distance, Lipschitz constraints and spectral normalization. Recall that (D1) demands that the optimal discriminator is Lipschitz:

(D1) is -Lipschitz for all , i.e., .

If is differentiable, this is equivalent to that the optimal discriminator has a gradient with bounded norm. This is a sensible criterion, since a discriminator whose gradient norm is too large may push the generator too hard and destabilize its training.

To check (D1), the following proposition shows that it suffices to check whether for all distributions :

{restatable}

propositionPropWassersteinLipschitz (D1) holds if and only if is -Lipschitz w.r.t. the Wasserstein-1 distance.

Arjovsky et al. (2017) show that this property does not hold for common divergences based on the Kullback–Leibler or Jensen–Shannon divergence, while it does hold for the Wasserstein-1 distance. Indeed, it is this desirable property that motivates their introduction of the Wasserstein GAN. Framed in our context, their result is summarized as follows:

{restatable}

propositionPropGANLipschitz The minimax and non-saturating GAN losses do not satisfy (D1) for some . {restatable}propositionPropWGANLipschitz The Wasserstein GAN loss satisfies (D1) with for any .

Our stability analysis therefore deepens the analysis of Arjovsky et al. (2017) and provides an alternative reason that the Wasserstein distance is desirable as a metric: it is part of a sufficient condition that ensures stationarity of gradient descent.

### 4.1 From Wasserstein distance to Lipschitz constraints

Having identified the Wasserstein GAN loss as one that satisfies (D1), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. Towards this, we define the regularizer and compute its convex conjugate :

 R1(ξ):=α∥ξ∥KR=αsupf∈C(X)||f||Lip≤1∫fdξ,R⋆1(φ)={0∥φ∥Lip≤α∞otherwise. (8)

This norm is the Kantorovich–Rubinstein norm (KR norm), which extends the Wasserstein-1 distance to ; it holds that for . Then, its inf-convolution with an arbitrary function inherits the Lipschitz property held by : {restatable}[Pasch–Hausdorff]propositionPropPaschHausdorff Let be a function, and define . Then is -Lipschitz w.r.t. the distance induced by the KR norm, and hence the Wasserstein-1 distance when restricted to .

Due to Section 4, we now obtain a transformed loss function that automatically satisfies (D1). This function is a generalization of the Pasch–Hausdorff envelope (see Chapter 9 in Rockafeller and Wets (1998)), also known as Lipschitz regularization or the McShane–Whitney extension (McShane, 1934; Whitney, 1934; Kirszbraun, 1934; Hiriart-Urruty, 1980).

The convex conjugate computation in (8) shows that can be minimized in practice by imposing Lipschitz constraints on discriminators. Indeed, by (4),

 infμ(J⊕α∥⋅∥KR)(μ) =infμsupφEx∼μ[φ(x)]−J⋆(φ)−χ{∥φ∥Lip≤α} (9) =infμsupφ: ∥φ∥Lip≤αJ(μ,φ). (10)

Farnia and Tse (2018) consider this loss in the special case of an -GAN with ; they showed that minimizing corresponds to training a -GAN normally but constraining the discriminator to be -Lipschitz. We show that this technique is in fact generic for any : minimizing the transformed loss can be achieved by training the GAN as normal, but imposing a Lipschitz constraint on the discriminator.

Our analysis therefore justifies the use of Lipschitz constraints, such as spectral normalization (Miyato et al., 2018) and weight clipping (Arjovsky and Bottou, 2017), for general GAN losses. However, Section 2 also suggests that applying only Lipschitz constraints may not be enough to stabilize GANs, as (D1) alone does not ensure that the GAN objective is smooth.

## 5 Enforcing (D2) with discriminator smoothness

(D2) demands that the optimal discriminator is smooth:

(D2) is -Lipschitz for all , i.e., .

Intuitively, this says that for a fixed generator , the optimal discriminator should not provide gradients that change too much spatially.

Although the Wasserstein GAN loss (D1), we see that it, along with the minimax GAN and the non-saturating GAN, do not satisfy (D2):

{restatable}

propositionPropNonSmoothWGAN The Wasserstein, minimax, and non-saturating GAN losses do not satisfy (D2) for some .

We now construct a loss that by definition satisfies (D2). Let be the class of -smooth functions, that is, for which , and consider the integral probability metric (IPM) (Müller, 1997) w.r.t. , defined by

 IPMS(μ,ν):=supf∈S∫fdμ−∫fdν. (11)

The optimal discriminator for the loss is the function that maximizes the supremum in the definition. This function by definition belongs to and therefore is -smooth. Hence, this IPM loss satisfies (D2) with by construction.

### 5.1 From integral probability metric to smooth discriminators

Having identified the IPM-based loss as one that satisfies (D2), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. To do so, we define a regularizer and compute its convex conjugate :

 R2(ξ):=β1∥ξ∥S∗=β1supf∈S∫fdξ,R⋆2(φ)={0φ∈β1S∞otherwise. (12)

The norm is the dual norm to , which extends the IPM to signed measures; it holds that for . Similar to the situation in the previous section, inf-convolution preserves the smoothness property of :

{restatable}

propositionPropIPMConvolution Let be a convex, proper, lower semicontinuous function, and define . Then the optimal discriminator for is -smooth.

Applying (4) and (12), we see that we can minimize this transformed loss function by restricting the family of discriminators to only -smooth discriminators:

 infμ(J⊕β1∥⋅∥S∗)(μ) =infμsupφEx∼μ[φ(x)]−J⋆(φ)−χ{φ∈β1S} (13) =infμsupφ∈β1SJ(μ,φ). (14)

In practice, we can enforce this by applying spectral normalization (Miyato et al., 2018) and using a Lipschitz, smooth activation function such as ELU (Clevert et al., 2016) or sigmoid.

{restatable}

propositionPropSmoothActivation Let be a neural network consisting of layers whose linear transformations have spectral norm and whose activation functions are -Lipschitz and -smooth. Then is -smooth.

## 6 Enforcing (D3) with gradient penalties

(D3) is the following smoothness condition:

(D3) is -Lipschitz for any , i.e., .

(D3) requires that the gradients of the optimal discriminator do not change too rapidly in response to changes in . Indeed, if the discriminator’s gradients are too sensitive to changes in the generator, the generator may not be able to accurately follow those gradients as it updates itself using a finite step size. In finite-dimensional optimization of a function , this condition is analogous to having a Lipschitz gradient.

We now present an equivalent characterization of (D3) that is easier to check in practice. We define the Bregman divergence of a convex function by

 DJ(ν,μ):=J(ν)−J(μ)−∫Φμ(x)d(ν−μ), (15)

where is the optimal discriminator for at . Then, (D3) is characterized in terms of the Bregman divergence and the KR norm as follows: {restatable}propositionPropVariationalSmoothness Let be a convex function. Then satisfies (D3) if and only if for all .

It is straightforward to compute the Bregman divergence corresponding to several popular GANs:

 DDJS(⋅||μ0)(ν,μ)=DKL(12ν+12μ0||12μ+12μ0)+12DKL(ν||μ), (16) DDKL(12⋅+12μ0||μ0)(ν,μ)=DKL(12ν+12μ0||12μ+12μ0), (17) D12MMD2(⋅,μ0)(ν,μ)=12MMD2(ν,μ). (18)

The first two Bregman divergences are not bounded above by for reasons similar to those discussed in Section 4, and hence: {restatable}propositionPropVariationalSmoothnessGAN The minimax and non-saturating GAN losses do not satisfy (D3) for some .

Even so, the Bregman divergence for the non-saturating loss is always less than that of the minimax GAN, suggesting that the non-saturating loss should be stable in more situations than the minimax GAN. On the other hand, the MMD-based loss (Li et al., 2015) does satisfy (D3) when its kernel is the Gaussian kernel :

{restatable}

propositionPropVariationaSmoothnessMMD The MMD loss with Gaussian kernel satisfies (D3) with for all .

### 6.1 From maximum mean discrepancy to gradient penalties

Having identified the MMD-based loss as one that satisfies (D3), we next follow the strategy outlined in Section 3 to convert it into a regularizer for an arbitrary loss function. To do so, we define the regularizer and compute its convex conjugate :

 R3(ξ):=β24π∥^ξ∥2H,R⋆3(φ)=πβ2∥φ∥2H. (19)

The norm is the norm of a reproducing kernel Hilbert space norm (RKHS) with Gaussian kernel; this norm extends the MMD to signed measures, as it holds that for . Here, denotes the mean embedding of a signed measure ; we also adopt the convention that if . Similar to the situation in the previous sections, inf-convolution preserves the smoothness property of :

{restatable}

[Moreau–Yosida regularization]propositionPropMoreauYosida Suppose is convex, and define . Then is convex, and .

By Section 6, this transformed loss function satisfies (D3), having inherited the regularity properties of the squared MMD. This transformed function is a generalization of Moreau–Yosida regularization or the Moreau envelope (see Chapter 1 in Rockafeller and Wets (1998)). It is well-known that in the case of a function , this regularization results in a function with Lipschitz gradients, so it is unsurprising that this property carries over to the infinite-dimensional case.

Applying (4) and (19), we see that the transformed loss function can be minimized as a GAN by implementing an RKHS squared norm penalty on the discriminator:

 infμ(J⊕β24π||⋅||2H)(μ) =infμsupφEx∼μ[φ(x)]−J⋆(φ)−πβ2||φ||2H. (20)

Computationally, the RKHS norm is difficult to evaluate. We propose taking advantage of the following infinite series representation of in terms of the derivatives of (Fasshauer and Ye, 2011; Novak et al., 2018):

{restatable}

propositionPropGaussianNormExpansion Let be an RKHS with the Gaussian kernel . Then for ,

 ||f||2H =∞∑k=0(4π)−k∑k1+⋯+kd=k1∏di=1ki!||∂k1x1⋯∂kdxdf||2L2(Rd) (21) =||f||2L2(Rd)+14π||∇f||2L2(Rd)+116π2||∇2f||2L2(Rd)+other terms. (22)

In an ideal world, we would use this expression as a penalty on the discriminator to enforce (D3). Of course, as an infinite series, this formulation is computationally impractical. However, the first two terms are very close to common GAN techniques like gradient penalties (Gulrajani et al., 2017) and penalizing the output of the discriminator (Karras et al., 2018). We therefore interpret these common practices as partially applying the penalty given by the RKHS norm squared, approximately enforcing (D3). We view the choice of only using the leading terms as a disadvantageous but practical necessity.

Interestingly, according to our analysis, gradient penalties and spectral normalization are not interchangeable, even though both techniques were designed to constrain the Lipschitz constant of the discriminator. Instead, our analysis suggests that they serve different purposes: gradient penalties enforce the variational smoothness (D3), while spectral normalization enforces Lipschitz continuity (D1). This demystifies the puzzling observation of Miyato (2018) that GANs using only spectral normalization with a WGAN loss do not seem to train well; it also explains why using both spectral normalization and a gradient penalty is a reasonable strategy. It also motivates the use of gradient penalties applied to losses other than the Wasserstein loss (Fedus et al., 2018).

## 7 Verifying the theoretical learning rate

In this section, we empirically test the theoretical learning rate given by Sections 1 and 2 as well as our regularization scheme (7) based on inf-convolutions. We approximately implement our composite regularization scheme (7) on a trivial base loss of by alternating stochastic gradient steps on

 infμsupφEx∼μ[φ(x)]−Ex∼μ0[φ(x)]−πβ2Ex∼~μ[φ(x)2+14π||∇φ(x)||2], (23)

where is a random interpolate between samples from and , as used in Gulrajani et al. (2017). The regularization term is a truncation of the series for the squared RKHS norm (22) and approximately enforces (D3). The discriminator is a 7-layer convolutional neural network with spectral normalization1 and ELU activations, an architecture that enforces (D1) and (D2). We include a final scalar multiplication by so that by Section 5.1, . We take two discriminator steps for every generator step, to better approximate our assumption of an optimal discriminator.

For the generator, we use an extremely simple particle-based generator which satisfies (G1) and (G2), in order to minimize the number of confounding factors in our experiment. Let be the discrete uniform distribution on . For an matrix and , define so that is the th row of . The particle generator satisfies (G1) with , since

 Ez[∥fθ(z)−fθ′(z)∥2]=1Nn∑z=1∥θz−θ′z∥2≤1√N∥θ−θ′∥F, (24)

and it satisfies (G2) with , since is constant w.r.t. . With this setup, Section 2 suggests a theoretical learning rate of

 γ0=1L=1αB+A2(β1+β2)=N7α+β2. (25)

We randomly generated hyperparameter settings for the Lipschitz constant , the smoothness constant , the number of particles , and the learning rate . We trained each model for 100,000 steps on CIFAR-10 and evaluate each model using the Fréchet Inception Distance (FID) of Heusel et al. (2017). We hypothesize that stability is correlated with image quality; Figure 1 plots the FID for each hyperparameter setting in terms of the ratio of the true learning rate and the theoretically motivated learning rate . We find that the best FID scores are obtained in the region where is between 1 and 1000. For small learning rates , we observe that the convergence is too slow to make a reasonable progress on the objective, whereas as the learning rate gets larger , we observe a steady increase in FID, signalling unstable behavior. It also makes sense that learning rates slightly above the optimal rate produce good results, since our theoretical learning rate is a conservative lower bound. Note that our intention is to test our theory, not to generate good images, which is difficult due to our weak choice of generator. Overall, this experiment shows that our theory and regularization scheme are sensible.

## 8 Future work

Inexact gradient descent In this paper, we employed several assumptions in order to regard the GAN algorithm as gradient descent. However, real-world GAN algorithms must be treated as “inexact” descent algorithms. As such, future work includes: (i) relaxing the optimal discriminator assumption (cf. Sanjabi et al. (2018)) or providing a stability result for discrete simultaneous gradient descent (cf. continuous time analysis in Nagarajan and Kolter (2017); Mescheder et al. (2018)), (ii) addressing stochastic approximations of gradients (i.e., SGD), and (iii) providing error bounds for the truncated gradient penalty used in (23).

Generator architectures Another important direction of research is to seek more powerful generator architectures that satisfy our smoothness assumptions (G1) and (G2). In practice, generators are often implemented as deep neural networks, and involve some specific architectures such as deconvolution layers (Radford et al., 2015) and residual blocks (e.g., Gulrajani et al. (2017); Miyato et al. (2018)). In this paper, we did not provide results on the smoothness of general classes of generators, since our focus is to analyze stability properties influenced by the choice of loss function (and therefore optimal discriminators). However, our conditions (G1) and (G2) shed light on how to obtain smoothly parameterized neural networks, which is left for future work.

#### Acknowledgments

We would like to thank Kohei Hayashi, Katsuhiko Ishiguro, Masanori Koyama, Shin-ichi Maeda, Takeru Miyato, Masaki Watanabe, and Shoichiro Yamaguchi for helpful discussions.

## Appendix A Inf-convolution in Rd

To gain intuition on the inf-convolution, we present a finite-dimensional analogue of the techniques in Section 3. For simplicity of presentation, we will omit any regularity conditions (e.g., lower semicontinuity). We refer readers to Chapter 12 of Bauschke and Combettes (2011) for a detailed introduction.

Let and be convex functions on . The inf-convolution of and is a function defined as

 (J⊕R)(x):=infz∈RdJ(z)+R(x−z).

The inf-convolution is often called the epigraphic sum since the epigraph of coincides with the Minkowski sum of epigraphs of and , as Figure 2 illustrates. The inf-convolution is associative and commutative operation; that is, it is always true that and .

There are two important special cases of inf-convolutions: The first one is the Pasch–Hausdorff envelope , which is the inf-convolution between and (). It is known that becomes -Lipschitz. The second important example is the Moreau envelope , i.e., the inf-convolution with the quadratic regularizer . The Moreau envelope is always differentiable, and the gradient of is -Lipschitz (thus is -smooth).

It is worth noting that the set of minimizers does not change after these two operations. More generally, we have the following result:

###### Proposition 1.

Let be proper and lower semicontinuous functions with and . Suppose and for some increasing function . Then, = and .

To sum up, given a function , we can always construct a regularized alternative that is -Lipschitz and -smooth and has the same minimizers as .

The next question is how to implement the inf-convolution in GAN-like optimization problems. For this, it is convenient to consider the convex conjugate. Recall that the Fenchel–Moreau theorem says that there is a duality between a convex function and its convex conjugate as and . The important property is that the convex conjugate of the inf-convolution is the sum of convex conjugates, that is, we always have

 (J⊕R)⋆(z)=J⋆(z)+R⋆(z).

This property can be useful for implementing the regularized objective as follows. First, we can check that the convex conjugates of the norm and the squared norm are given as and . Hence, we have

 Jβα(x):=(J⊕α∥⋅∥2⊕12β∥⋅∥22)(x)=supz: ∥z∥2≤α⟨x,z⟩−J⋆(z)−β2∥z∥22,

which means that minimizing can be recast in min-max problem with the norm clipping and -regularization on the dual variable .

## Appendix B Common GAN losses

For completeness and clarity, we explicitly write out the expressions for the losses listed in Table 1. For more detailed computations of optimal discriminators, see Chu et al. (2019); for more details on the convex duality interpretation, see Farnia and Tse (2018).

Minimax GAN Goodfellow et al. (2014) originally proposed the minimax GAN and showed that the corresponding loss function for the minimax GAN is the Jensen–Shannon divergence, defined as

 J(μ):=DJS(μ||μ0):=12DKL(μ||12μ+12μ0)+12DKL(μ0||12μ+12μ0),

where is a fixed probability measure (usually the empirical measure of the data), and is the Kullback–Leibler divergence between and . The optimal discriminator in the sense of Definition 1 is given as

 Φμ(x)=12logdμd(μ+μ0)(x),

where is the Radon–Nikodym derivative. If and have densities and , then

 dμd(μ+μ0)(x)=μ(x)μ(x)+μ0(x),

so our optimal discriminator matches that of Goodfellow et al. (2014) up to a constant factor and logarithm. To recover the minimax formulation, the convex duality (4) yields:

 infμDJS(μ,μ0) =infμsupφEx∼μ[φ(x)]−(−12Ex∼μ0[log(1−e2φ(x)+log2)]−12log2(DJS(⋅,μ0))⋆(φ)) =infμsupD12Ex∼μ[log(1−D(x))]+12Ex∼μ0[logD(x)],

using the substitution .

Non-saturating GAN Goodfellow et al. (2014) also proposed the heuristic non-saturating GAN. Theorem 2.5 of Arjovsky and Bottou (2017) shows that the loss function minimized is

 J(μ):=DKL(12μ+12μ0||μ0)=12DKL(μ||μ0)−DJS(μ||μ0).

The optimal discriminator is

 Φμ(x)=−12logdμ0d(μ+μ0)(x).

Wasserstein GAN Arjovsky et al. (2017) proposed the Wasserstein GAN, which minimizes the Wasserstein-1 distance between the input and a fixed measure :

 J(μ):=W1(μ,μ0):=infπE(x,y)∼π[||x−y||],

where the infimum is taken over all couplings , probability distributions over whose marginals are and respectively. The optimal discriminator is called the Kantorovich potential in the optimal transport literature (Villani, 2009). The convex duality recover the Wasserstein GAN:

 infμW1(μ,μ0) =infμsupφEx∼μ[φ(x)]−(Ex∼μ0[φ(x)]+χ{||φ||Lip≤1}(W1(⋅,μ0))⋆(φ)) =infμsup||φ||Lip≤1Ex∼μ[φ(x)]−Ex∼μ0[φ(x)],

an expression of Kantorovich–Rubinstein duality. The Lipschitz constraint on the discriminator is typically enforced by spectral normalization (Miyato et al., 2018), less frequently by weight clipping (Arjovsky et al., 2017), or heuristically by gradient penalties (Gulrajani et al., 2017) (although this work shows that gradient penalties may serve a different purpose altogether).

Maximum mean discrepancy Given a positive definite kernel , the maximum mean discrepancy (MMD, Gretton et al. (2012)) between and is defined by

 J(μ):=12MMD2K(μ,ν):=12∫K(x,y)(μ−ν)(dx)(μ−ν)(dy).

where is the reproducing kernel Hilbert space (RKHS) for . The generative moment-matching network (GMMN, Li et al. (2015)) and the Coulomb GAN (Unterthiner et al., 2018) use the squared MMD as the loss function. The optimal discriminator in this case is

 Φμ(x)=Ey∼μ[K(x,y)]−Ey∼μ0[K(x,y)],

which in constrast to other GANs, may be approximated by simple Monte Carlo, rather than an auxiliary optimization problem.

Note that MMD-GANs (Li et al., 2017a; Arbel et al., 2018) minimize a modified version of the MMD, the Optimized MMD (Sriperumbudur et al., 2009; Arbel et al., 2018). These MMD-GANs are adversarial in a way that does not arise from convex duality, so our theory currently does not apply to these GANs.

Integral probability metrics An integral probability metric (Müller, 1997) is defined by

 J(μ):=IPMF(μ,μ0):=supf∈F∫fdμ−∫fdμ0,

where is a class of functions. The optimal discriminator is the function that maximizes the supremum in the definition. The Wasserstein distance may be thought of as an IPM with containing all -Lipschitz functions. The MMD may be thought of as an IPM with all functions with RKHS norm at most , but no GANs based on MMD are actually trained this way, as it is difficult to constrain the discriminator to such functions.

## Appendix C Optimal discriminators are functional derivatives

Let be a convex function. Recall the definition of the optimal discriminator (Definition 1):

 Φμ∈argmaxφ∈C(X)∫φdμ−J⋆(φ).

This definition can be understood as an infinite dimensional analogue of subgradients. In fact, in finite-dimensional convex analysis, is a subgradient of if and only if it can be written as . The calculus of subgradients shares many properties with the standard calculus of derivatives, such as chain rules (Rockafeller and Wets, 1998). This motivate us to investigate derivative-like features of optimal discriminators.

We introduce the functional derivative, also known as the von Mises influence function:

###### Definition 3 (Functional derivative).

Let be a function of probability measures. We say that a continuous function is a functional derivative of at if