Training Generative Adversarial Networks by Solving Ordinary Differential Equations

# Training Generative Adversarial Networks by Solving Ordinary Differential Equations

## Abstract

The instability of Generative Adversarial Network (GAN) training has frequently been attributed to gradient descent. Consequently, recent methods have aimed to tailor the models and training procedures to stabilise the discrete updates. In contrast, we study the continuous-time dynamics induced by GAN training. Both theory and toy experiments suggest that these dynamics are in fact surprisingly stable. From this perspective, we hypothesise that instabilities in training GANs arise from the integration error in discretising the continuous dynamics. We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training – when combined with a regulariser that controls the integration error. Our approach represents a radical departure from previous methods which typically use adaptive optimisation and stabilisation techniques that constrain the functional space (e.g. Spectral Normalisation). Evaluation on CIFAR-10 and ImageNet shows that our method outperforms several strong baselines, demonstrating its efficacy.

## 1 Introduction

The training of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) has seen significant advances over the past several years. Most recently, GAN based methods have, for example, demonstrated the ability to generate images with high fidelity and realism such as the work of Brock et al. (2018) and Karras et al. (2019). Despite this remarkable progress, there remain many questions regarding the instability of training GANs and their convergence properties.

In this work, we attempt to extend the understanding of GANs by offering a different perspective. We study the continuous-time dynamics induced by gradient descent on the GAN objective for commonly used losses. We find that under mild assumptions, the dynamics should converge in the vicinity of a differential Nash equilibrium, and that the rate of convergence is independent of the rotational part of the dynamics if we can follow the dynamics exactly. We thus hypothesise that the instability in training GANs arises from discretisation of the continuous dynamics, and we should focus on accurate integration to impart stability.

Consistent with this hypothesis, we demonstrate that we can use standard methods for solving ordinary differential equations (ODEs) – such as Runge-Kutta – to solve for GAN parameters. In particular, we observe that more accurate time integration of the ODE yields better convergence as well as better performance overall; a result that is perhaps surprising given that the integrators have to use noisy gradient estimates. We find that the main ingredient we need for stable GAN training is to avoid large integration errors, and a simple regulariser on the generator gradients is sufficient to achieve this. This alleviates the need for hard constraints on the functional space of the discriminator (e.g. spectral normalisation (Miyato et al., 2018)) and enables GAN training without advanced optimisation techniques.

Overall, the contributions of this paper are as follows:

• We present a novel and practical view that frames GAN training as solving ODEs.

• We design a regulariser on the gradients to improve numerical integration of the ODE.

• We show that higher-order ODE solvers lead to better convergence for GANs. Surprisingly, our algorithm (ODE-GAN) can train GANs to competitive levels without any adaptive optimiser (e.g. Adam (Kingma and Ba, 2015)) and explicit functional constraints (Spectral Normalisation).

## 2 Background and Notation

We study the GAN objective which is often described as a two-player min-max game with a Nash equilibrium at the saddle point, , of the objective function

 J(θ,ϕ)=Ex∼p(x)[log(D(x;θ))]+Ez∼p(z)[log(1−D(G(z;ϕ);θ))], (1)

where we denote the states of the discriminator and the generator respectively by their parameters and . Following the convention of Goodfellow et al. (2014), we use to denote a data sample, and for the output of the generator (by transforming a sample from a noise source ). Further, stands for the expected value of a function given the distribution . In practice, the problem is often transformed into one where the objective function is asymmetric (e.g., the generator’s objective is changed to ). We can describe this more general setting, which we focus on here, by using

 ℓ(θ,ϕ)=[ℓD(θ,ϕ),\leavevmode\nobreak ℓG(θ,ϕ)], (2)

to denote the loss vector of the discriminator-generator pair, and considering minimisation of wrt. and wrt. . This setting is often referred to as a general-sum game. The original min-max problem from Eq. (1) can be captured in this notation by setting , and other commonly used variations (such as the non-saturating loss (Goodfellow et al., 2014)) can also be accommodated.

## 3 Ordinary Differential Equations Induced by GANs

Here we derive the continuous form of GANs’ training dynamics by considering the limit of infinitesimal time steps. Throughout, we consider a general-sum game with the loss pair as described in Section 2.

### 3.1 Continous Dynamics for GAN training

Given the losses from Eq. (2), the evolution of the parameters , following simultaneous gradient descent (GD), is given by the following updates at iteration

 θk+1 =θk−α∂ℓD∂θ(θk,ϕk)Δt ϕk+1 =ϕk−β∂ℓG∂ϕ(θk,ϕk)Δt, (3)

where 1 are optional scaling factors. Then and correspond to the learning rates in the discrete case. Previous work has analysed the dynamics in this discrete case (mostly focusing on the min-max game), see e.g., Mescheder et al. (2017). In contrast, we consider arbitrary loss pairs under the continuous dynamics induced by gradient descent. That is we consider and to have explicit dependence on continuous time. This perspective has been taken for min-max games in Nagarajan and Kolter (2017). With this dependence and . Then, as , Eq. (3) yields a dynamical system described by the ordinary differential equation

 ⎛⎝dθdtdϕdt⎞⎠=v(θ,ϕ), (4)

where . This is also known as infinitesimal gradient descent (Singh et al., 2000).

Perhaps surprisingly, if we view the problem of training GANs from this perspective we can make the following observation. Assuming we track the dynamical system exactly – and the gradient vector field is bounded – then in the vicinity of a differential Nash equilibrium2 , converges to this point at a rate independent of the frequency of rotation with respect to the vector field 3.

There are two direct consequences from this observation: First, changing the rate of rotation of the vector field does not change the rate of convergence. Second, if the velocity field has no subspace which is purely rotational, then it suggests that in principle GAN training can be reduced to the problem of accurate time integration. Thus if the dynamics are attracted towards a differential Nash they should converge (though reaching this regime depends on initial conditions).

### 3.2 Convergence of GAN Training under Continous Dynamics

We next show that, under mild assumptions, close to a differential Nash equilibrium, the continuous time dynamics from Eq (4) converge to the equilibrium point for GANs under commonly used losses. We further clarify that they locally converge even under a strong rotational field as in Fig. 1, a problem studied in recent GAN literature (see also e.g.  Balduzzi et al. (2018); Gemp and Mahadevan (2018) and  Mescheder et al. (2017)).

#### Analysis of Linearised Dynamics for GANs in the Vicinity of Nash Equilibria

We here show that, in the vicinity of a differential Nash equilibrium, the dynamics converge unless the field has a purely rotational subspace. We prove convergence under this assumption for the continuous dynamics induced by GANs using the cross-entropy or the non-saturated loss. This analysis has strong connections to work by Nagarajan and Kolter (2017), where local convergence was analysed in a more restricted setting for min-max games.

Let us consider the dynamics close to a local Nash equilibrium where by definition. We denote the increment as , then where . This Jacobian has the following form for GANs:

 H=⎛⎜ ⎜⎝∂2ℓD∂θ2∂2ℓD∂ϕ∂θ∂2ℓG∂θ∂ϕ∂2ℓG∂ϕ2⎞⎟ ⎟⎠∣∣ ∣ ∣∣(θ∗,ϕ∗). (5)

For the cross-entropy loss, the non-saturated loss, and the Wasserstein loss we observe that at the local Nash, (please see Appendix B for a derivation). Consequently, for GANs, we consider the linearised dynamics of the following form:

 ⎛⎝dθdtdϕdt⎞⎠=−(ABT−BC)(θϕ), (6)

where denote the elements of .

###### Lemma 3.1.

Given a linearised vector field of the form shown in Eq. (6) where either and or and ; and is full rank. Following the dynamics will always converge to .

###### Proof.

The proof requires three parts: We first show that of this form must be invertible in Lemma A.1, then we show, as a result, in Lemma A.2, that the real part of the eigenvalues for must be strictly greater than 0. The third part follows from the previous Lemmas: as the eigenvalues of , , satisfy the solution of the linear system converges at least at rate . Consequently, as , . ∎

We observe for the cross entropy loss and the non-saturating loss , if we train with piece-wise linear activation functions – for details see Appendix C. Thus for these losses and if convergence occurs (i.e. under the assumption that one of or is positive definite, the other semi-positive definite), we should converge to the differential Nash equilibrium in its vicinity if we follow the dynamics exactly.

###### Remark.

We note that if we consider a non-hyperbolic fixed point, namely the eigenvalues of the Jacobian of the dynamics have no real-parts, then following the dynamics will only rotate around the fixed point. An example with such a non-hyperbolic fixed point is given by , , for any . However, we argue that this special case is not applicable for neural network based GANs. To see this, note that the discriminator loss on the real examples is totally independent of the generator. Consequently, in this case we should consider .

#### Convergence Under a Strong Rotational Field

As an illustrative example of Lemma 3.1, and to verify the importance of accurate time integration of the dynamics, we provide a simple toy example with a strong rotational vector field. Consider a two-player game where the loss functions are given by:

 ℓD(θ,ϕ)=12ϵθ2−θϕ,\leavevmode\nobreak ℓG(θ,ϕ)=θϕ,

This game maps directly to the linearised form from above (and in addition it satisfies for , as in Lemma 3.1). When , the Nash is at . The vector field of the system is

 ⎛⎝dθdtdϕdt⎞⎠=−(ϵ−110)(θϕ). (7)

When , it has the analytical solution

 θ(t)=e−ϵt/2(a0cos(ωt)+b0sin(ωt)), (8) ϕ(t)=e−ϵt/2(a1cos(ωt)+b1sin(ωt)), (9)

where and can be determined from initial conditions. Thus the dynamics will converge to the Nash as independent of the initial conditions. In Fig. 1 we compare the difference of using a first order numerical integrator (Euler’s) vs a second order integrator (two-stage Runge-Kutta, also known as Heun’s method) for solving the dynamical system numerically when . When we chose for 200 timesteps, Euler’s method diverges while RK2 converges.

## 4 ODE-GANs

In this section we outline our practical algorithm for applying standard ODE solvers to GAN training; we call the resulting model an ODE-GAN. With few exceptions (such as the exponential function), most ODEs cannot be solved in closed-form and instead rely on numerical solvers that integrate the ODEs at discrete steps. To accomplish this, an ODE solver approximates the ODE’s solution as a cumulative sum in small increments.

We denote the following to be an update using an ODE solver; which takes the velocity function , current parameter states and a step-size as input:

 (θk+1ϕk+1)=ODEStep(θk,ϕk,v,h). (10)

Note that is the function for the velocity field defined in Eq. (4). For an Euler integrator the method would simply compute , and is equivalent to simultaneous gradient descent with step-size . However, as we will outline below, higher-order methods such as the fourth-order Runge-Kutta method can also be used. After the update step is computed, we add a small amount of regularisation to further control the truncation error of the numerical integrator. A listing of the full procedure is given in Algorithm 1.

### 4.1 Numerical Solvers for ODEs

We consider several classical ODE solvers for the experiments in this paper, although any ODE solver may be used with our method.

Different Orders of Numerical Integration We experimented with a range of solvers with different orders. The order controls how the truncation error changes with respect to step size . Namely, the errors of first order methods reduce linearly with decreasing , while the errors of second order methods decrease quadratically. The ODE solvers considered in this paper are: first order Euler’s method, a second order Runge Kutta method – Heun’s method – (RK2), and fourth order Runge-Kutta method (RK4). For details on the explicit updates to the GAN parameters applied by each of the methods we refer to Appendix E. Computational costs for calculating grow with higher-order integrators; in our implementation, the most expensive solver considered (RK4) was less than slower (in wall-clock time) than standard GAN training.

Connections to Existing Methods for Stabilizing GANs We further observe (derivation in Appendix F) that existing methods such as Consensus Optimization (Mescheder et al., 2017), Extragradient (Korpelevich, 1976; Chavdarova et al., 2019) and Symplectic Gradient Adjustment (Balduzzi et al., 2018) can be seen as approximating higher order integrators.

### 4.2 Practical Considerations for Stable integration of GANs

To guarantee stable integration, two issues need to be considered: exploding gradients and the noise from mini-batch gradient estimation.

Gradient Regularisation. When using deep neural networks in the GAN setting, gradients are not a priori guaranteed to be bounded (as required for ODE integration). In particular, looking at the form of we can make two observations. First, the discriminator’s gradient is grounded by real data and we found it to not explode in practice. Second, the generator’s gradient can explode easily, depending on the discriminator’s functional form during learning4. This prevents us from using even moderate step-sizes for integration. However, in contrast to solving a standard ODE, the dynamics in our ODE are given by learned functions. Thus we can control the gradient magnitude to some extent. To this end, we found that a simple mitigation is to regularise the discriminator such that the gradients are bounded. In particular we use the regulariser

 R(θ)=λ∥∥∥∂ℓG∂ϕ∥∥∥2, (11)

whose gradient wrt. the discriminator parameters is well-defined in the GAN setting. Importantly, since this regulariser vanishes as we approach a differential Nash equilibrium, it does not change the parameters at the equilibrium. Empirically, we found that this regulariser can efficiently control integration errors (see Appendix D for details).

This regulariser is similar to the one proposed by Nagarajan and Kolter (2017). While they suggest adjusting updates to the generator (using the regularizer ), we found this did not control the gradient norm well; as our experiments suggest that it is the gradient of the loss wrt. the generator parameters that explodes in practice. It also shares similarities with the penalty term from Gulrajani et al. (2017). However, we regularise the gradient with respect to the parameters rather than the input.

A Note on Using Approximate Gradients. Practically, when numerically integrating the GAN equations, we resort to Monte-Carlo approximation of the expectations in . As is common, we use a mini-batch of samples from a fixed dataset to approximate expectations involving and use the same number of random samples from to approximate respective expectations. Calculating the partial derivatives based on these Monte-Carlo estimates leads to approximate gradients, i.e. we use , where approximates the noise introduced due to sampling. We note that, the continuous GAN dynamics themselves are not stochastic (i.e. they do not form a stochastic differential equation) and noise purely arises from approximation errors, which decrease with more samples. While one could expect the noise to affect integration quality we empirically observed competitive results with integrators of order greater than one.

## 5 Experiments

We evaluate ODE-GANs on data from a mixture of Gaussians, CIFAR-10 (Krizhevsky et al., 2009) (unconditional), and ImageNet (Deng et al., 2009) (conditional). All experiments use the non-saturated loss , which is also covered by our theory. Remarkably, our experiments suggest that simple Runge-Kutta integration with the regularisation is competitive (in terms of FID and IS) with common methods for training GANs, while simultaneously keeping the discriminator and generator loss close to the true Nash payoff values at convergence – of for the discriminator and for the generator. Experiments are performed without Adam (Kingma and Ba, 2015) unless otherwise specified and we evaluate IS and FID on 50k images. We also emphasise inspection of the behaviour of training at convergence and not just at the best scores over training to determine whether a stable fixed point is found. Code is available at https://github.com/deepmind/deepmind-research/tree/master/ode_gan.

### 5.1 Mixture of Gaussians

We compare training with different integrators for a mixture of Gaussians (Fig. 2). We use a two layer MLP (25 units) with ReLU activations, latent dimension of 32 and batch size of 512. This low dimensional problem is solvable both with Euler’s method as well as Runge-Kutta with gradient regularisation – but without regularisation gradients grow large and integration is harmed, see Appendix H.3. In Fig. 2, we see that both the discriminator and generator converge to the Nash payoff value as shown by their corresponding losses; at the same time, all the modes were recovered by both players. As expected, the convergence rate using Euler’s method is slower than RK4.

### 5.2 CIFAR-10 and ImageNet

We use 50K samples to evaluate IS/FID. Unless otherwise specified, we use the DCGAN from (Radford et al., 2015) for the CIFAR-10 dataset and ResNet-based GANs Gulrajani et al. (2017) for ImageNet.

#### Different Orders of ODE Solvers

As shown in Fig. 3, we find that moving from a first order to a second order integrator can significantly improve training convergence. But when we go past second order we see diminishing returns. We further observe that higher order methods allow for much larger step sizes: Euler’s method becomes unstable with while Heun’s method (RK2) and RK4 do not. On the other hand, if we increase the regularisation weight, the performance gap between Euler and RK2 is reduced (results for higher regularisation tabulated in Table 3 in the appendix). This hints at an implicit effect of the regulariser on the truncation error of the integrator.

The regulariser controls the truncation error by penalising large gradient magnitudes (see Appendix D). We illustrate this with an embedded method (Fehlberg method, comparing errors between 3rd and 2nd order), which tracks the integration error over the course of training. We observe that larger leads to smaller error. For example, gives an average error of , yields with RK4 (see appendix Fig. 11). We depict the effects of using different values in Fig. 5.

#### Loss Profiles for the Discriminator and Generator

Our experiments reveal that using more accurate ODE solvers results in loss profiles that differ significantly to curves observed in standard GAN training, as shown in Fig. 5. Strikingly, we find that the discriminator loss and the generator loss stay very close to the values of a Nash equilibrium, which are for the discriminator and for the generator (shown by red lines in Fig. 5, see (Goodfellow et al., 2014)). In contrast, the discriminator dominates the game when using the Adam optimiser, evidenced by a continuously decreasing discriminator loss, while the generator loss increases during training. This imbalance correlates with the well-known phenomenon of worsening FID and IS in late stages of training (we show this in Table 3, see also e.g. Arjovsky and Bottou (2017)).

#### Comparison to Standard GAN training

This section compares using ODE solvers versus standard methods for GAN training. Our results challenge the widely-held view that adaptive optimisers are necessary for training GANs, as revealed in Fig. 7. Moreover, the often observed degrading performance towards the end of training disappears with improved integration. To our knowledge, this is the first time that competitive results for GAN training have been demonstrated for image generation without adaptive optimisers. We also compare ODE-GAN with SN-GAN in Fig. 7 (re-trained and tuned using our code for fair comparison) and find that ODE-GAN can improve significantly upon SN-GAN for both IS and FID. Comparisons to more baselines are listed in Table 1. For the DCGAN architecture, ODE-GAN (RK4) achieves 17.66 FID as well as 7.97 IS, note that these best scores are remarkably close to scores we observe at the end of training.

ODE solvers and adaptive optimisers can also be combined. We considered this via the following approach (listed as ODE-GAN(RK4+Adam) in tables): we use the adaptive learning rates computed by Adam to scale the gradients used in the ODE solver. Table 1 shows that this combination reaches similar best IS/FID, but then deteriorates. This observation suggests that Adam can efficiently accelerate training, but the convergence properties may be lost due to the modified gradients – analysing these interactions further is an interesting avenue for future work. For a more detailed comparison of RK4 and Adam, see Table 3 in the appendix.

In Tables 1 and 2, we present results to test ODE-GANs at a larger scale. For CIFAR-10, we experiment with the ResNet architecture from Gulrajani et al. (2017) and report the results for baselines using the same model architecture. ODE-GAN achieves 11.85 in FID and 8.61 in IS. Consistent with the behavior we see in DCGAN, ODE-GAN (RK4) results in stable performance throughout training. Additionally, we trained a conditional model on ImageNet 128 128 with the ResNet used in SNGAN without Spectral Normalisation (for further details see Appendix G and H.1). ODE-GAN achieves 26.16 in FID and 38.71 in IS. We have also trained a larger ResNet on ImageNet (see Appendix G), where we can obtain 22.29 for FID and 46.17 for IS using ODE-GAN (see Table 2). Consistent with all previous experiments, we find that the performance is stable and does not degrade over the course of training: see Fig. 8 and Tables 1 and 2.

## 6 Discussion and Relation to Existing Work

Our work explores higher-order approximations of the continuous dynamics induced by GAN training. We show that improved convergence and stability can be achieved by faithfully following the vector field of the adversarial game – without the necessity for more involved techniques to stabilise training, such as those considered in Salimans et al. (2016); Balduzzi et al. (2018); Gulrajani et al. (2017); Miyato et al. (2018). Our empirical results thus support the hypothesis that, at least locally, the GAN game is not inherently unstable. Rather, the discretisation of GANs’ continuous dynamics, yielding inaccurate time integration, causes instability. On the empirical side, we demonstrated for training on CIFAR-10 and ImageNet that both Adam (Kingma and Ba, 2015) and spectral normalisation (Miyato et al., 2018), two of the most popular techniques, may harm convergence, and that they are not necessary when higher-order ODE solvers are available.

The dynamical systems perspective has been employed for analysing GANs in previous works (Nagarajan and Kolter, 2017; Wang et al., 2019; Mescheder et al., 2017; Balduzzi et al., 2018; Gemp and Mahadevan, 2018). They mainly consider simultaneous gradient descent to analyse the discretised dynamics. In contrast, we study the link between GANs and their underlying continuous time dynamics which prompts us to use higher-order integrators in our experiments. Others made related connections: for example, using a second order ODE integrator was also considered in a simple 1-D case for GANs in Gemp and Mahadevan (2018), and Nagarajan and Kolter (2017) also analysed the continuous dynamics in a more restrictive setting – in a min-max game around the optimal solution. We hope that our paper can encourage more work in the direction of this connection (Wang et al., 2019), and adds to the valuable body of work on analysing GAN training convergence (Wang et al., 2020; Fiez et al., 2019; Mescheder et al., 2018).

Lastly, it is worth noting that viewing traditionally discrete systems through the lens of continuous dynamics has recently attracted attention in other parts of machine learning. For example, the Neural ODE (Chen et al., 2018) interprets the layered processing of residual neural networks (He et al., 2016) as Euler integration of a continuous system. Similarly, we hope our work can contribute towards establishing a bridge for utilising tools from dynamical systems for generative modelling.

This work offers a perspective on training generative adversarial networks through the lens of solving an ordinary differential equation. As such it helps us connect an important part of current studies in machine learning (generative modelling) to an old and well studied field of research (integration of dynamical systems).

Making this connection more rigorous over time could help us understand how to better model natural phenomena, see e.g. Zoufal et al. (2019) and Casert et al. (2020) for recent steps in this direction. Further, tools developed for the analysis of dynamical systems could potentially help reveal in what form exploitable patterns exist in the models we are developing – or their dynamics – and as a result contribute to the goal of learning robust and fair representations (Gowal et al., 2019).

The techniques proposed in this paper make training of GAN models more stable. This may result in making it easier for non-experts to train such models for beneficial applications like creating realistic images or audio for assistive technologies (e.g. for the speech-impaired, or technology for restoration of historic text sources). On the other hand, the technique could also be used to train models used for nefarious applications, such as forging images and videos (often colloquially referred to as “DeepFakes”). There are some research projects to find ways to mitigate this issue, one example is the DeepFakes Detection Challenge (Dolhansky et al., 2019). {ack} We would especially like to thank David Balduzzi for insightful initial discussions and Ian Gemp for careful reading of our paper and feedback on the work.

## Supplementary

We present details on the proofs from the main paper in Sections A-C. We include analysis on the effects of regularisation on the truncation error (Section D). Update rules for the ODE solvers considered in the main paper are presented in Section E.The connections between our method and Consensus optimisation, SGA and extragradient are reported in Section F. Further details of experiments/additional experimental results are in Section G-H. Image samples are shown in Section I.

## Appendix A Real Part of the Eigenvalues are Positive

###### Definition A.1.

The strategy is a differential Nash equilibrium if at this point the first order derivatives and second order derivatives satisfy and , (see Ratliff et al. [2016]).

###### Lemma A.1.

For a matrix of form

 H=(ABT−BC).

Given is full rank, if either and or and , is invertible.

###### Proof.

Let’s assume that , then we note . Thus by definition and are invertible as there are no zero eigenvalues. The Schur complement decomposition of is given by

 H=(I0−BA−1I)(A00C+BA−1BT)(IA−1BT0I)

Each matrix in this is invertible, thus is invertible. ∎

###### Lemma A.2.

For a matrix of form

 (ABT−BC),

if either and or vice versa, and is full rank, the positive part of the eigenvalues of is strictly positive.

###### Proof.

We assume and . The proof for and is analogous.

To see that the real part of the eigenvalues are positive, we first note that the matrix satisfies the following:

 H=(ABT−BC)⇒xTHx≥0.

To see this more explicitly, note

 [xT,yT]\leavevmode\nobreak H\leavevmode\nobreak [x,y]=[xT,yT]\leavevmode\nobreak (ABT−BC)\leavevmode\nobreak [x,y]=xTAx+yTCy≥0\leavevmode\nobreak ∀x∈RN,y∈RM,

where and are the dimensions of and respectively. From this, we can derive the following property about the eigenvalue with corresponding eigenvector .

 (H−λ)v=0 (12) ⇒(H−α−iβ)(ur+iuc)=0 (13)

Thus the following conditions hold:

 (H−α)ur+βuc=0 (14) (H−α)uc−βur=0. (15)

Thus we can multiply the top equation by and the bottom equation by to retrieve the following:

 α=uTrHur+uTcHucuTrur+uTcuc≥0

To see that this is strictly above zero, we note that we can split the eigenvector via the following:

 ur+iuc=(u0r+iu0cu1r+iu1c).

Thus now can be rewritten as the following:

 α=(u0r)TAu0r+(u1r)TCu1r+(u0c)TAu0c+(u1c)TCu1cuTrur+uTcuc

Since , for this to be zero we note that the following must hold

 ur+iuc=(0u1r+iu1c).

If this is the form of an eigenvector of then we get the following:

 (ABT−BC)(0u)=(BTuCu)=λ(0u)

where . One condition which is needed is that . Another condition needed for this to be an eigenvector is that is an eigenvalue of . However this would mean that the eigenvalue is real. If this eigenvalue was thus at , the matrix would not be invertible. Thus concluding the proof. ∎

Linear Analysis: The last part of the proof in Lemma 3.1 can be seen by using linear dynamical systems analysis. Namely when we have the dynamics given by where and is a diagonalisable matrix with eigenvalues . Then we can decompose where is the corresponding eigenvector to eigenvalue . The solution can be derived as .

## Appendix B Off-Diagonal Elements are Opposites at the Nash

Here we show that the off diagonal elements of the Hessian with respect to the Wasserstein, cross-entropy and non-saturating loss are opposites. For the cross-entropy and Wasserstein losses, this property hold for zero-sum games by definition. Thus we only show this for the non-saturating loss.

For the non-saturating loss, the objectives for the discriminator and the generator is given by the following:

 ℓD(θ,ϕ) =−∫xpd(x)log(D(x;θ))dx+∫zp(z)log(1−D(G(z;ϕ);θ))dz (16) ℓG(θ,ϕ) =−∫zp(z)log(D(G(z;ϕ);θ)dz. (17)

We transform this with where is now drawn from the probability distribution . We can rewrite this loss as the following:

 ℓD(θ,ϕ) =−∫x(pd(x)log(D(x;θ))+pg(x;ϕ)log(1−D(x;θ))dx (18) ℓG(θ,ϕ) =−∫xpg(z;ϕ)log(D(x;θ))dx. (19)

Note that if and only if the following condition is true

 ∂2(ℓD+ℓG)∂θ∂ϕ=0.

For the non-saturating loss this becomes the following:

 ∂2(ℓD+ℓG)∂θ∂ϕ=−∫x∂D(x;θ)∂θ∂pg(x;ϕ)∂ϕ(1D(x;θ)−11−D(x;θ))dx (20)

At the global Nash, we know that and , thus this is identically zero.

## Appendix C Hessian with Respect to the Discrimator’s Parameters is Semi-Positive Definite for Piecewise-Linear Activation Functions

In this section, we show that when we use piecewise-linear activation functions such as ReLU or LeakyReLUs in the case of the cross-entropy loss or the non saturating loss (as they are the same loss for the discriminator), the Hessian wrt. the discriminator’s parameters will be semi-positive definite. Here we also make the assumption that we are never at the point where the piece-wise function switches state (as the curvature is not defined at these points). Thus we note

 ∂2D∂θ2=0.

For the cross-entropy loss and non-saturating losses, the discriminator network often outputs the logit, , for a sigmoid function . The Hessian with respect to the parameters is given by:

 ℓD(θ,ϕ) =−∫xpD(x)log(σ(D(x;θ)))+pG(x)log(1−σ(D(x;θ)))dx ∂ℓD∂θ =−∫x(pD(x)(1−σ(D(x;θ)))−pG(x)σ(D(x;θ)))∂D(x;θ)∂θdx ⇒∂2ℓD∂θ2 =∫x(pD(x)+pG(x))σ(D(x;θ))(1−σ(D(x;θ)))∂D(x;θ)∂θ∂D(x;θ)∂θTdx⪰0. (21)

## Appendix D Gradient Norm Regularisation and Effects on Integration Error

Here we provide intuition for why gradient regularisation after applying the might be needed. We start by considering how we can approximate the truncation error of Euler’s method with stepsize .

The update rule of Euler is given by

 yk+1=yk+hv(yk).

Note here . The truncation error is approximated by comparing this update to that when we half the step size and take two update steps:

 ~yk+1 =yk+h2v(yk)+h2v(yk+h2v(yk)) (22) =yk+hv(yk)+h24v(yk)∂v∂y(yk)+O(h3). (23) =yk+1+h24v(yk)∂v∂y(yk)+O(h3) (24) ⇒τk+1 =~yk+1−yk+1∼h24v(yk)∂v∂y(yk)=O(|v|). (25)

Here we see that the truncation error is linear with respect to the magnitude of the gradient. Thus we need the magnitude of the gradient to be bounded in order for the truncation error to be small.

## Appendix E Numerical Integration Update Steps

##### Euler’s Method:
 (θk+1ϕk+1)=(θkϕk)+hv(θk,ϕk) (26)
##### Heun’s Method (RK2):
 (~θk~ϕk)=(θkϕk)+hv(θk,ϕk) (27)
##### Runge Kutta 4 (RK4):
 v1=v(θk,ϕk) v2=v(θk+h2(v1)θ,ϕk+h2(v1)ϕ) v3=v(θk+h2(v2)θ,ϕk+h2(v2)ϕ) v4=v(θk+h(v3)θ,ϕk+h(v3)ϕ) (28)

Note corresponds to the vector field element corresponding to , similarly for .

## Appendix F Existing Methods which Approximate Second-order ODE Solvers

Here we show that previous methods such as consensus optimisation [Mescheder et al., 2017], SGA [Balduzzi et al., 2018] or Crossing-the-Curl [Gemp and Mahadevan, 2018], and extragradient methods [Korpelevich, 1976, Chavdarova et al., 2019] approximates second-order ODE solvers5.

To see this let’s consider an “second-order" ODE solver of the following form:

 (~θk~ϕk)=(θkϕk)+γv(θk,ϕk)=(θkϕk)+h (29) (θk+1ϕk+1)=(θkϕk)+h2(av(θk,ϕk)+bv(~θk,~ϕk)) (30)

where and scales of each term.

Extra-gradient: Note that when and and , this is the extra-gradient method by definition [Korpelevich, 1976].

Consensus Optimisation: Note that and , we get Heun’s method (RK2). For consensus optimisation, we only need . If we perform Taylor Expansion on Eq (30):

 (θk+1ϕk+1)=(θkϕk)+hv(θk,ϕk)+h((∇v)Th+O(|h|2)). (31)

With this, we see that this approximates consensus optimisation [Mescheder et al., 2017].

SGA/Crossing the Curl: To see we can approximate one part of SGA [Balduzzi et al., 2018]/Crossing-the-Curl [Gemp and Mahadevan, 2018], the update is now given by:

 (32)

where denotes the element of the vector field corresponding to and similarly for , and are given in Eq. (29). The Taylor expansion of this has only the off-diagonal block elements of the matrix . More explicitly this is written out as the following:

 (θk+1ϕk+1)=(θkϕk)+h2⎛⎜⎝vθ+γvϕ∂vθ∂ϕvϕ+γvθ∂vϕ∂θ⎞⎟⎠+O(h|h|2) (33)

All algorithms are known to improve GAN’s convergence, and we hypothesise that these effects are also related to improving numerical integration.

## Appendix G Experimental Setup

We use two GAN architectures for image generation of CIFAR-10: the DCGAN [Radford et al., 2015] modified by Miyato et al. [2018] and the ResNet [He et al., 2016] from Gulrajani et al. [2017], with additional parameters from [Miyato et al., 2018], but removed spectral-normalisation. For conditional ImageNet generation we use a similar ResNet architecture from Miyato et al. [2018] with conditional batch-normalisation [Dumoulin et al., 2017] and projection [Miyato and Koyama, 2018] but with spectral-normalisation removed. We also consider another ResNet where the size of the hidden layers is the previous ResNet, we also increase the latent size from 128 to 256, we denote this as ResNet (large).

### g.1 Hyperparameters for CIFAR-10 (DC-GAN)

For Euler, RK2 and RK4 integration, we first set for 500 steps. Then we go to till 400k steps and then we decrease the learning rate by half. For Euler integration we found that will be unstable, so we use . We train using batch size 64. The regularisation weight used is . For the Adam optimiser, we use for the generator and for the discriminator, with .

### g.2 Hyperparameters for CIFAR-10 (ResNet)

For RK4 first we set for the first 500 steps. Then we go to till 400k steps and then we decrease the learning rate by . We train with batch size 64. The regularisation weight used is . For the Adam optimiser, we use for both the generator and the discriminator, with .

### g.3 Hyperparameters for ImageNet (ResNet)

The same hyperparameters are used for ResNet and ResNet (large). For RK4 first we set for the first 15k steps, then we go to . We train with batch size 256. The regularisation weight used is .

Here we show experiments on ImageNet; ablation studies on using different orders of numerical integrators; effects of gradient regularisation; as well as experiments on combining the RK4 ODE solver with the Adam optimiser.

### h.1 Conditional ImageNet Generation

Fig. 8 shows that ODE-GAN can significantly improve upon SNGAN with respect to both Inception Score (IS) and Fréchet Inception Distance (FID). As in [Miyato et al., 2018] we use conditional batch-normalisation [Dumoulin et al., 2017] and projection [Miyato and Koyama, 2018]. Similarly to what we have observed in CIFAR-10, the performance degrades over the course of training when we use SNGAN whereas ODE-GAN continues improving. We also note that as we increase the architectural size (i.e. increasing the size of hidden layers and latent size to 256) the IS and FID we obtain using SNGAN gets worse. Whereas, for ODE-GAN we see an improvement in both IS and FID, see Table 2 and Fig. 8. We want to note that our algorithm seems to be more prone to landing in NaNs during training for conditional models, something we would like to further understand in future work.

### h.3 Supplementary: Effects of Regularisation

Gradient regularisation allows us to control the magnitude of the gradient. We hypothesise that this helps us control for integration errors. Holding the step size constant, we observe that decreasing the regularisation weight lead to increased gradient norms and integration errors (Fig. 11), causing divergence. This is shown explicitly in Fig. 9, where we show that, with low regularisation weight , the losses for the discriminator and the generator start oscillating heavily around the point where the gradient norm rapidly increases.

##### Gradient Regularisation vs SN and Truncation Error Analysis

We find that our regularisation (Grad Reg) outperforms spectral normalisation (SN) as measured by FID and IS (Fig. 11). Meanwhile, Fig. 11 depicts the integration error (from Fehlberg’s method) over the course of training. As is visible, heavier regularisation leads to smaller integration errors.

## Appendix I ODE-GAN Samples

### Footnotes

1. For brevity, we set in our analysis.
2. We refer to Ratliff et al. (2016) for an overview on local and differential Nash equilibria.
3. Namely, the magnitude of the imaginary eigenvalues of the Hessian does not affect the rate of convergence.
4. We have observed large gradients when the generator samples are poor, where the discriminator may perfectly distinguish the generated samples from data. This potentially sharp decision boundary can drive the magnitude of the generator’ gradients to infinity.
5. Note that methods such as consensus optimisation or SGA/Crossing-the-Curl are methods which address the rotational aspects of the gradient vector field.

### References

1. Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
2. David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
3. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
4. Corneel Casert, Kyle Mills, Tom Vieijra, Jan Ryckebusch, and Isaac Tamblyn. Optical lattice experiments at unobserved conditions and scales through generative adversarial deep learning. arXiv preprint arXiv:2002.07055, 2020.
5. Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon Lacoste-Julien. Reducing noise in gan training with variance reduced extragradient. In Advances in Neural Information Processing Systems, pages 391–401, 2019.
6. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.
7. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
8. Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854, 2019.
9. Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017.
10. Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
11. Ian Gemp and Sridhar Mahadevan. Global Convergence to the Equilibrium of GANs using Variational Inequalities. In Arxiv:1808.01531, 2018.
12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
13. Sven Gowal, Chongli Qin, Po-Sen Huang, Taylan Cemgil, Krishnamurthy Dvijotham, Timothy Mann, and Pushmeet Kohli. Achieving robustness in the wild via adversarial mixing with disentangled representations. arXiv preprint arXiv:1912.03192, 2019.
14. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
16. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
17. Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
18. GM Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
19. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
20. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1825–1835, 2017.
21. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pages 3478–3487, 2018.
22. Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
23. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
24. Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pages 5585–5595, 2017.
25. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
26. Lillian J Ratliff, Samuel A Burden, and S Shankar Sastry. On the characterization of local nash equilibria in continuous games. IEEE Transactions on Automatic Control, 61(8):2301–2307, 2016.
27. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
28. Satinder P Singh, Michael J Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In UAI, pages 541–548, 2000.
29. Dávid Terjék. Adversarial lipschitz regularization. In International Conference on Learning Representations, 2020.
30. Chuang Wang, Hong Hu, and Yue Lu. A solvable high-dimensional model of gan. In Advances in Neural Information Processing Systems, pages 13759–13768, 2019.
31. Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In International Conference on Learning Representations, 2020.
32. Christa Zoufal, Aurélien Lucchi, and Stefan Woerner. Quantum generative adversarial networks for learning and loading random distributions. npj Quantum Information, 5(1):1–9, 2019.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters