ODE Analysis of Stochastic Gradient Methods withOptimism and Anchoring for Minimax Problems and GANs

# ODE Analysis of Stochastic Gradient Methods with Optimism and Anchoring for Minimax Problems and GANs

Ernest K. Ryu   Kun Yuan   Wotao Yin
Department of Mathematics, UCLA
Electrical and Computer Engineering, UCLA
eryu@math.ucla.edu, kunyuan@ucla.edu, wotao.yin@alibaba-inc.com
###### Abstract

Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze last-iterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convex-concavity, guided by a continuous-time analysis with differential equations. First, we show that simGD, as is, converges with stochastic sub-gradients under strict convexity in the primal variable. Second, we generalize optimistic simGD to accommodate an optimism rate separate from the learning rate and show its convergence with full gradients. Finally, we present anchored simGD, a new method, and show convergence with stochastic subgradients.

ODE Analysis of Stochastic Gradient Methods with
Optimism and Anchoring for
Minimax Problems and GANs

Ernest K. Ryu   Kun Yuan   Wotao Yin Department of Mathematics, UCLA Electrical and Computer Engineering, UCLA DAMO Academy, Alibaba US eryu@math.ucla.edu, kunyuan@ucla.edu, wotao.yin@alibaba-inc.com

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

Training of generative adversarial networks (GAN) [19], solving a minimax game using stochastic gradients, is known to be difficult. Despite the remarkable empirical success of GANs, further understanding the global training dynamics empirically and theoretically is considered a major open problem [18, 54, 39, 37, 48].

The local training dynamics of GANs is understood reasonably well. Several works have analyzed convergence assuming the loss functions have linear gradients and the training uses full (deterministic) gradients. Although the linear gradient assumption is reasonable for local analysis (even though the loss functions may not be continuously differentiable due to ReLU activation functions) such results say very little about global convergence. Although the full gradient assumption is reasonable when the learning rate is small, such results say very little about how the randomness affects the training.

This work investigates global convergence of simultaneous gradient descent (simGD) and its variants for zero-sum games with a convex-concave cost using using stochastic subgradients. We specifically study convergence of the last iterates rather than the averaged iterates.

Section 2 presents convergence of simGD with stochastic subgradients under strict convexity in the primal variable. The goal is to establish a minimal sufficient condition of global convergence for simGD without modifications. Section 3 presents a generalization of optimistic simGD [8], which allows an optimism rate separate from the learning rate. We prove the generalized optimistic simGD using full gradients converges, and experimentally demonstrate that the optimism rate must be tuned separately from the learning rate when using stochastic gradients. However, it is unclear whether optimistic simGD is theoretically compatible with stochastic gradients. Section 4 presents anchored simGD, a new method, and presents its convergence with stochastic subgradients. The presentation and analyses of Sections 2, 3, and 4 are guided by continuous-time first-order ordinary differential equations (ODE). In particular, we interpret optimism and anchoring as discretizations of certain regularized dynamics. Section 5 experimentally demonstrates the benefit of optimism and anchoring for training GANs in some setups.

##### Prior work.

There are several independent directions for improving the training of GANs such as designing better architectures, choosing good loss functions, or adding appropriate regularizers [54, 2, 60, 1, 20, 64, 59, 37, 38, 41]. In this work, we accept these factors as a given and focus on how to train (optimize) the model effectively.

Optimism is a simple modification to remedy the cycling behavior of simGD, which can occur even under the bilinear convex-concave setup [8, 9, 10, 36, 17, 28, 42, 50]. These prior work assume the gradients are linear and use full gradients. Although the recent name ‘optimism’ originates from its use in online optimization [6, 55, 56, 62], the idea dates back to Popov’s work in the 1980s [53] and has been studied independently in the mathematical programming community [35, 32, 34, 33, 7].

Classical literature analyze convergence of the Polyak-averaged iterates (which assigns less weight to newer iterates) when solving convex-concave saddle point problems using stochastic subgradients [47, 46, 24, 17]. For GANs, however, last iterates or exponentially averaged iterates [66] (which assigns more weight to newer iterates) are used in practice. Therefore, the classical work using Polyak averaging do not fully explain the empirical success of GANs.

The classical techniques used for the analyses of this work, the stochastic approximation technique [12, 22], ideas from control theory [22, 45], ideas from variational inequalities and monotone operator theory [16, 17], and continuous-time ODE analysis [22, 7], have been utilized for analyzing GANs.

Finally, we point out that the results of this work are broadly applicable beyond GANs since minimax game formulations are also used in other areas of machine learning such as actor-critic models [51] and domain adversarial networks [14, 13, 15, 31].

## 2 Stochastic simultaneous subgradient descent

Consider the cost function and the minimax game . We say is a solution to the minimax game or a saddle point of if

 L(x⋆,u)≤L(x⋆,u⋆)≤L(x,u⋆),∀x∈Rm,u∈Rn.

We assume

 L is convex-concave and has a saddle point. (A0)

By convex-concave, we mean is a convex function in for fixed and a concave function in for fixed . Define

 G(x,u)=[∂xL(x,u)∂u(−L(x,u))],

where and respectively denote the convex subdifferential with respect to and . For simplicity, write and . Note that if and only if is a saddle point. Since is convex-concave, the operator is monotone [58]:

 (g1−g2)T(z1−z1)≥0∀g1∈G(z1),g2∈G(z2),z1,z2∈Rm+n. (1)

Let be a stochastic subgradient oracle, i.e., for all , where is a random variable. Consider Simultaneous Stochastic Sub-Gradient Descent

 zk+1 =zk−αkg(zk;ωk) (SSSGD)

for , where is a starting point, are positive learning rates, and are IID random variables. (We read SSSGD as “triple-SGD”.) In this section, we provide convergence of SSSGD when is strictly convex in .

### 2.1 Continuous-time illustration

To understand the asymptotic dynamics of the stochastic discrete-time system, we consider a corresponding deterministic continuous-time system. For simplicity, assume is single-valued and smooth. Consider

 ˙z(t)=−g(t),g(t)=G(z(t))

with an initial value . (We introduce for notational simplicity.) Let be a saddle point, i.e., . Then does not move away from

 ddt12∥z(t)−z⋆∥2=−g(t)T(z(t)−z⋆)≤0,

where we used (1). However, there is no mechanism forcing to converge to a solution.

Consider the two examples and with

 G0(x,u)=[01−10][xu],Gρ(x,u)=[ρ1−10][xu] (2)

where and and . Note that is the canonical counter example that also arises as the Dirac-GAN [37]. See Figure 1.

The classical LaSalle–Krasnovskii invariance principle [26, 27] states that (paraphrased) if is a limit point of , then the dynamics starting at will have a constant distance to . On the left of Figure 1, we can see that is constant as for all . On the right of Figure 1, we can see that although when for (the dotted line) this derivative is temporary as will soon move past the dotted line. Therefore, can maintain a constant constant distance to only if it starts at , and is the only limit point of .

### 2.2 Discrete-time convergenece analysis

Consider the further assumptions

 ∞∑k=0αk=∞,∞∑k=0α2k<∞ (A1) Eω1,ω2∥g(z1;ω1)−g(z1;ω2)∥2≤R21∥z1−z1∥2+R22∀z1,z2∈Rm+n, (A2)

where and are independent random variables and and . These assumptions are standard in the sense that analogous assumptions are used in convex minimization to establish almost sure convergence of stochastic gradient descent.

###### Theorem 1.

Assume (A0), (A1), and (A2). Furthermore, assume is strictly convex in for all . Then SSSGD converges in the sense of where is a saddle point of .

We can alternatively assume is strictly concave in for all and obtain the same result.

The proof uses the stochastic approximation technique of [12]. We show that the discrete-time process converges (in an appropriate topology) to continuous-time trajectories satisfying a differential inclusion and use the LaSalle–Krasnovskii invariance principle to argue that limit points are solutions.

##### Related prior work.

Theorem 3.1 of [36] proves a similar convergence result under the stronger assumption of strict convex-concavity in both and for the more general mirror descent setup.

## 3 Simultaneous GD with optimism

Consider the setup where is continuously differentiable and we access full (deterministic) gradients

 G(x,u)=[∇xL(x,u)−∇uL(x,u)].

 zk+1 =zk−αG(zk)−β(G(zk)−G(zk−1)) (SimGD-O)

for , where is a starting point, , is learning rate, and is the optimism rate. Optimism is a modification to simGD that remedies the cycling behavior; for the bilinear example of (2), simGD (case ) diverges while SimGD-O with appropriate converges. In this section, we provide a continuous-time interpretation of SimGD-O as a regularized dynamics and provide convergence for the deterministic setup.

### 3.1 Continuous-time illustration

Consider the regularized continuous-time dynamics

 ˙ζ(t)=−αGβ(ζ(t)),

where is the Moreau–Yosida regularization of . With a change of variables we get

 ˙z(t)=−αg(t)−β˙g(t),g(t)=G(z(t)),

and the discretization and yields SimGD-O.

We further explain. The Moreau–Yosida [43, 67] regularization of with parameter is

 Gβ=β−1(I−(I+βG)−1).

To clarify, is the identity mapping and is the inverse (as a function) of , which is well-defined by Minty’s theorem [40]. It is straightforward to verify that if and only if , i.e., and share the same equilibrium points. For small , we can think of as an approximation that is better-behaved. Specifically, is merely monotone (satisfies (1)), but is furthermore -cocoercive, i.e.,

 (Gβ(z1)−Gβ(z2))T(z1−z2)≥β∥Gβ(z1)−Gβ(z2)∥2∀z1,z2∈Rm+n. (3)

We reparameterize the dynamics with and to get and

 ˙z(t)+β˙g(t)=˙ζ(t)=−αβ(ζ(t)−z(t))=−αg(t).

This gives us .

We now investigate convergence. Let satisfy (and therefore ). Then

 ddt12∥ζ(t)−z⋆∥2 =(ζ(t)−z⋆)T˙ζ(t)=−α(ζ(t)−z⋆)TGβ(ζ(t)) ≤−αβ∥Gβ(ζ(t))∥2,

where we use cocoercivity, (3). This translates to

 ddt12∥z(t)+βg(t)−z⋆∥2≤−αβ∥g(t)∥2. (4)

The quantity is nonincreasing since

 ddt12∥g(t)∥2 =−1α˙ζ(t)T˙g(t)=−1αlimh→01h2(ζ(t+h)−ζ(t))T(Gβ(ζ(t+h))−Gβ(ζ(t))) ≤−βαlimh→01h2∥Gβ(ζ(t+h))−Gβ(ζ(t))∥=−βα∥˙g(t)∥2≤0,

where we use cocoercivity, (3). Finally, integrating (4) on both sides gives us

 12∥z(t)+βg(t)−z⋆∥2−12∥z(0)+βg(0)−z⋆∥2≤−αβ∫t0∥g(s)∥2ds≤−αβt∥g(t)∥2 ∥g(t)∥2≤12αβt∥z(0)+βg(0)−z⋆∥2.

This analysis was inspired by [3, 7]: Attouch et al. [3] studied continuous-time dynamics with Moreau–Yosida regularization and Csetnek et al. [7] interpreted a forward-backward-forward-type method as a discretization of continuous-time dynamics with the Douglas–Rachford operator.

##### Other interpretations of optimism.

Daskalakis et al. interprets optimism as augmenting “follow the regularized leader” with the (optimistic) prediction that the next gradient will be the same as the current gradient in online learning setup [8]. Peng et al. interprets optimism as “centripetal acceleration” [50] but does not provide a formal analysis with differential equations.

### 3.2 Discrete-time convergenece analysis

The discrete-time method SimGD-O converges under the assumption

 L is differentiable and ∇L is R-Lipschitz continuous. (A3)
###### Theorem 2.

Assume (A0) and (A3). If and , then SimGD-O converges in the sense of

 mini=0,…,k∥G(zk)∥2≤1+β2α2R2α2(β−1/2−2β2αR)k∥z0+βG(z0)−z⋆∥2.

Furthermore, , where is a saddle point of .

The proof can be considered a discretization of the continuous-time analysis.

##### Related prior work.

Peng et al. [50] show convergence of convergence of simGD-O for and bilinear . Malitsky et al. [34, 7] show convergence of simGD-O when and convex-concave . Theorem 2 establishes convergence and is convex-concave.

### 3.3 Difficulty with stochastic gradients

Training in machine learning usually relies on stochastic gradients, rather than full gradietns. We can consider a stochastic variation of SimGD-O:

 zk+1 =zk−αkg(zk;ωk)−βk(g(zk;ωk)−g(zk−1;ωk−1)) (SimGD-OS)

with learning rate and optimism rate .

Figure 2 presents experiments of SimGD-OS on a simple bilinear problem. The choice where does not lead to convergence. Discretizing with a diminishing step leads to the choice and , but this choice as well does not converge. Rather, both and must be diminishing and , i.e., must diminish faster than for convergence. Rather, it is necessary to tune and separately as in Theorem 2 to obtain convergence and dynamics appear to be sensitive to the choice of and . One explanation of this difficulty is that the finite difference approximation is unreliable when using stochastic gradients.

Whether the observed convergence holds generally in the nonlinear convex-concave setup and whether optimism is compatible with subgradients is unclear. This motivates anchoring of the following section which is provably compatible with stochastic subgradients.

##### Related prior work.

Gidel et al. [17] show averaged iterates of SimGD-OS converges if iterates are projected onto a compact set. Mertikopoulos et al. [36] show almost sure convergence of SimGD-OS under strict convex-concavity. However, such analyses do not provide a compelling reason to use optimism since SimGD without optimism already converges under these setups.

## 4 Simultaneous GD with anchoring

Consider setup of Section 3. We propose Anchored Simultaneous Gradient Descent

 zk+1=zk−1−p(k+1)pG(zk)+(1−p)γk+1(z0−zk)

for , where is a starting point, , and is the anchor rate. The last term, the anchoring term, was inspired by Halpern’s method [21, 65, 29] and James–Stein estimator [61, 23]. In this section, we provide a continuous-time illustration of SimGD-A and provide convergence for both the deterministic and stochastic setups.

### 4.1 Continuous-time illustration

Consider the continuous-time dynamics

 ˙z(t)=−g(t)+γt(z0−z(t)),

for , where and . We obtain SimGD-A by discretizing this ODE with diminishing steps .

Define . Then

 0 ≤1h2⟨z(t+h)−z(t),g(t+h)−g(t)⟩→⟨˙z(t),˙g(t)⟩ as h→0.

Using this, we have

 ddt12∥˙z(t)∥2 =−⟨˙z(t),˙g(t)+γt˙z(t)+γt2(z0−z(t))⟩ =−⟨˙z(t),˙g(t)⟩−γt∥˙z(t)∥2+γt2⟨z(t)−z0,˙z⟩ ≤−γt∥˙z(t)∥2+γt2⟨z(t)−z0,˙z⟩.

Using , we have

 ddt12∥˙z(t)∥2+1t∥˙z(t)∥2≤γt2⟨z(t)−z0,˙z⟩.

Multiplying by and integrating both sides gives us

 t22∥˙z(t)∥2≤γ2∥z(t)−z0∥2.

Reorganizing, we get

 t22∥g(t)∥2−γt⟨g(t),z0−z(t)⟩+γ22∥z(t)−z0∥2≤γ2∥z(t)−z0∥2

Using , the monotonicity inequality, and Young’s inequality, we get

 ∥g(t)∥2≤2γt⟨g(t),z0−z(t)⟩≤2γt⟨g(t),z0−z⋆⟩≤12∥g(t)∥2+2γ2t2∥z0−z⋆∥2

and conclude

 ∥g(t)∥2≤4γ2t2∥z0−z⋆∥2.

Interestingly, anchoring leads to a faster rate compared to the rate of optimism in continuous time. The discretized method, however, is not faster than .

### 4.2 Discrete-time convergenece analysis and compatibility with stochastic subgradients

###### Theorem 3.

Assume (A0) and (A3). If and , then SimGD-A converges in the sense of

 ∥G(zk)∥2≤Ck2−2p+O(1k)

for for some .

The constant is computable, although it is complicated. The proof can be considered a discretization of the continuous-time analysis.

Consider the setup of Section 2. We propose Anchored Simultaneous Stochastic SubGradient Descent

 zk+1=zk−1−p(k+1)pg(zk;ωk)+(1−p)γ(k+1)1−ε(z0−zk)
###### Theorem 4.

Assume (A0) and (A2). If , , and , then SSSGD-A converges in the sense of , where is a saddle point.

To the best of our knowledge, Theorem 4 is the first result establishing last-iterate convergence for convex-concave cost functions using stochastic subgradients.

## 5 Experiments

In this section, we experimentally demonstrate the effectiveness of optimism and anchoring for training GANs. We train Wasserstein-GANs [2] with gradient penalty [20] on the MNIST and CIFAR-10 dataset and plot the Fréchet Inception Distance (FID) [22, 30]. The experiments were implemented in PyTorch [49]. We combine Adam with optimism and anchoring (described precisely in Appendix F) and compare it against the baseline Adam optimizer [25]. The generator and discriminator architectures and the hyperparameters are described in Appendix F. For optimistic and anchored Adam, we roughly tune the optimism and anchor rates and show the curve corresponding to the best parameter choice. Figure 3 shows an ensemble of samples generated at the end of the training period.

Figure 4 shows that the MNIST setup benefits from anchoring but not from optimism, while the CIFAR-10 setup benefits from optimism but not from anchoring. We leave comparing the effects of optimism and anchoring in practical GAN training (where the cost function is not convex-concave) as a topic of future work.

## 6 Conclusion

In this work, we analyzed the convergence of SSSGD, Optimistic simGD, and Anchored SSSGD. Under the assumption that the cost is convex-concave, Anchored SSSGD provably converges under the most general setup. Through experiments, we showed that the practical GAN training benefits from optimism and anchoring in some (but not all) setups.

Generalizing these results to accommodate projections and proximal operators, analogous to projected and proximal gradient methods, is an interesting direction of future work. Weight clipping [2] and spectral normalization [41] are instances where projections are used in training GANs.

#### Acknowledgments

We thank Yura Malitsky and Matthew Tam for the discussion on the reflection mechanism, which inspired our work of Section 3. We thank Adrien Taylor who brought to our attention recent work on the Halpern iteration, which inspired our work of Section 4. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. This work was partially supported by AFOSR MURI FA9550-18-10502, NSF DMS-1720237, and ONR N0001417121.

## References

• [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. ICLR, 2017.
• [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, pages 214–223, 2017.
• [3] H. Attouch and J. Peypouquet. Convergence of inertial dynamics and proximal algorithms governed by maximally monotone operators. Math. Program., 174(1):391–432, 2019.
• [4] J. P. Aubin and A. Cellina. Differential Inclusions: Set-Valued Maps and Viability Theory. Springer-Verlag, 1984.
• [5] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer-Verlag, 2nd edition, 2017.
• [6] C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. COLT, 2012.
• [7] E. R. Csetnek, Y. Malitsky, and M. K. Tam. Shadow Douglas–Rachford splitting for monotone inclusions. arXiv:1903.03393, 2019.
• [8] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. ICLR, 2018.
• [9] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization. NeurIPS, 2018.
• [10] C. Daskalakis and I. Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. ITCS, 2019.
• [11] A. Dembo. Lecture notes on probability theory: Stanford statistics 310. Accessed: 2019-05-10.
• [12] J. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim., 28(4):3229–3259, 2018.
• [13] H. Edwards and A. Storkey. Censoring representations with an adversary. ICLR, 2016.
• [14] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. ICML, 2015.
• [15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
• [16] I. Gemp and S. Mahadevan. Global convergence to the equilibrium of GANs using variational inequalities. arXiv:1808.01531, 2018.
• [17] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. ICLR, 2019.
• [18] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv:1701.00160, 2016.
• [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NeurIPS, 2014.
• [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. NeurIPS, 2017.
• [21] B. Halpern. Fixed points of nonexpanding maps. Bull. Amer. Math. Soc., 73(6):957–961, 1967.
• [22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS, 2017.
• [23] W. James and C. Stein. Estimation with quadratic loss. In J. Neyman, editor, Proc. Fourth Berkeley Symp. Math. Statist. Prob, volume 1, pages 361–379, 1961.
• [24] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst., 1(1):17–58, 2011.
• [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
• [26] N. N. Krasovskii. Some Problems in the Theory of Motion Stability. Fizmatgiz, Moscow, 1959.
• [27] J. LaSalle. Some extensions of Liapunov’s second method. IRE Trans. Circuit Theory, 7(4):520–527, 1960.
• [28] T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. AISTATS, 2019.
• [29] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Online:2017-11-6336, 2017.
• [30] M. Lucic, K. Kurach, M. Michalski, O. Bousquet, and S. Gelly. Are GANs created equal? a large-scale study. NeurIPS, 2018.
• [31] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. ICLR, 2016.
• [32] Y. Malitsky. Projected reflected gradient methods for monotone variational inequalities. SIAM J. Optim., 25(1):502–520, 2015.
• [33] Y. Malitsky. Golden ratio algorithms for variational inequalities. arXiv:1803.08832, 2018.
• [34] Y. Malitsky and M. K. Tam. A forward-backward splitting method for monotone inclusions without cocoercivity. arXiv:1808.04162, 2018.
• [35] Y. V. Malitsky and V. V. Semenov. An extragradient algorithm for monotone variational inequalities. Cybern. Syst. Anal., 50(2):271–277, 2014.
• [36] P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. ICLR, 2019.
• [37] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? ICML, 2018.
• [38] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. NeurIPS, 2017.
• [39] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. ICLR, 2017.
• [40] G. J. Minty. Monotone (nonlinear) operators in Hilbert space. Duke Math. J., 29(3):341–346, 1962.
• [41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
• [42] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv:1901.08511, 2019.
• [43] J. J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93:273–299, 1965.
• [44] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 2011.
• [45] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. NeurIPS, 2017.
• [46] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
• [47] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
• [48] A. Odena. Open questions about generative adversarial networks (online article). Accessed: 2019-05-10.
• [49] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. NeurIPS Autodiff Workshop, 2017.
• [50] W. Peng, Y. Dai, H. Zhang, and L. Cheng. Training GANs with centripetal acceleration. arXiv:1902.08949, 2019.
• [51] D. Pfau and O. Vinyals. Connecting generative adversarial networks and actor-critic methods. NeurIPS Workshop on Adversarial Training, 2016.
• [52] B. T. Polyak. Introduction to optimization. Optimization Software, 1987.
• [53] L. D. Popov. A modification of the Arrow–Hurwicz method for search of saddle points. Mat. Zametki, 28(5):777–784, 1980.
• [54] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016.
• [55] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. COLT, 2013.
• [56] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. NeurIPS, 2013.
• [57] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics, pages 233–257. Academic Press, 1971.
• [58] R. T. Rockafellar. Monotone operators associated with saddle-functions and minimax problems. In F. E. Browder, editor, Nonlinear Functional Analysis, Part 1, volume 18 of Proceedings of Symposia in Pure Mathematics, pages 241–250. American Mathematical Society, 1970.
• [59] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. NeurIPS, 2017.
• [60] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP inference for image super-resolution. ICLR, 2017.
• [61] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In J. Neyman, editor, Proc. Third Berkeley Symp. Math. Statist. Prob., volume 1, pages 197–206, 1956.
• [62] V. Syrgkanis, A. Agarwal, H. Luo, and R. E. Schapire. Fast convergence of regularized learning in games. NeurIPS, 2015.
• [63] A. Taylor and F. Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT, 2019.
• [64] X. Wei, B. Gong, Z. Liu, W. Lu., and L. Wang. Improving the improved training of Wasserstein GANs: A consistency term and its dual effect. ICLR, 2018.
• [65] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Arch. Math., 58(5):486–491, 1992.
• [66] Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectiveness of averaging in GAN training. ICLR, 2019.
• [67] K. Yosida. On the differentiability and the representation of one-parameter semi-group of linear operators. J. Math. Soc. Japan, 1(1):15–21, 1948.

## Appendix A Notation and preliminaries

Write to denote the set of nonnegative real numbers and to denote inner product, i.e., for .

We say is a point-to-set mapping on if maps points of to subsets of . For notational simplicity, we write

 ⟨A(x)−A(y),x−y⟩={⟨u−v,x−y⟩|u∈A(x),v∈A(y)}.

Using this notation, we define monotonicity of with

 ⟨A(x)−A(y),x−y⟩≥0∀x,y∈Rd,

where the inequality requires every member of the set to be nonnegative. We say a monotone operator is maximal if there is no other monotone operator such that the containment

 {(x,u)|u∈A(x)}⊂{(x,u)|u∈B(x)}

is proper. If is convex-concave, then the subdifferential operator

 G(x,u)=[∂xL(x,u)∂u(−L)(x,u)]

is maximal monotone [58]. By [5, Proposition 20.36], is closed-convex for any . By [5, Proposition 20.38(iii)] maximal monotone operators are upper semicontinuous in the sense that if is maximal monotone, then for and imply . (In other words, the graph of is closed.) Define , which is the set of saddle-points or equilibrium points. When is maximal monotone, is a closed convex set. Write

 PZer(G)(z0)=argminz∈Zer(G)∥z−z0∥

for the projection onto .

Write for the space of -valued continuous functions on . For , we say in if uniformly on bounded intervals, i.e., for all , we have

 limn→∞supt∈[0,T]∥fn(t)−f(t)∥=0.

In other words, we consider the topology of uniform convergence on compact sets.

We rely on the following inequalities, which hold for any any .

 ∥a+b∥2≤2∥a∥2+2∥b∥2 (5) ⟨a,b⟩≤12ε∥a∥2+ε2∥b∥2. (6)

In particular, (6) is called Young’s inequality.

###### Lemma 1 (Theorem 5.3.33 [11]).

Let be a martingale such that

 E[∥mk∥2]<∞

for all and

 ∞∑k=0E[∥mk+1−mk∥2|Fk]<∞

then converges almost surely to a limit.

###### Lemma 2 (Robbins–Siegmund [57]).

Let , , , and be nonnegative -measurable random sequences satisfying

 EkVk+1≤(1+βk)Vk−Sk+Uk.

If, furthermore,

 ∞∑k=1βk<∞,∞∑k=1Uk<∞

holds almost surely, then

 Vk→V∞,Sk→0

almost surely, where is a random limit.

Define

 ~G(z)=Eωg(z;ω)∈G(z).

Note that is possible even if when is not continuously differentiable.

###### Lemma 3.

Under Assumptions (A0) and (A2), we have

 Eω∥g(z;ω)∥2≤R23∥z−z⋆∥2+R24

for some and .

###### Proof.

Let be a saddle point, which exists by Assumption (A0). Let and be independent and identically distributed. Then

 Eω∥g(z;ω)∥2 ≤Eω∥g(z;ω)∥2+Eω′∥g(z⋆;ω′)−~G(z⋆)∥2 =Eω,ω′∥g(z;ω)−g(z⋆;ω′)+~G(z⋆)∥2 ≤Eω,ω′2∥g(z;ω)−g(z⋆;ω′)∥2+2∥~G(z⋆)∥2 ≤2R21∥z−z⋆∥2+2R22+2∥~G(z⋆)∥2

where we use the fact that is a zero-mean random variable, Assumption (A2), and (5). The stated result holds with and . ∎

## Appendix B Proof of Theorem 1

Consider the differential inclusion

 ˙z(t)∈−G(z(t)) (7)

with the initial condition . We say satisfies (7) if there is a Lebesgue integrable such that

 z(t)=z0+∫t0ζ(s)ds,ζ(t)∈−G(z(t)),∀t≥0. (8)
###### Lemma 4 (Theorem 5.2.1 [4]).

If is maximal monotone, the solution to (7) exists and is unique. Furthermore, is -Lipschitz continuous for all .

Write and call the time evolution operator. In other words, maps the initial condition of the differential inclusion to the point at time .

###### Lemma 5 (LaSalle–Krasnovskii).

If satisfies (7), then as and .

This proof can be considered an adaptation of the LaSalle–Krasnovskii invariance principle [26, 27] to the setup of differential inclusions. The standard result applies to differential equations.

###### Proof.

Consider any , which exists by Assumption (A0). Since is absolutely continuous, so is , and we have

 ddt12∥z(t)−z⋆∥2=⟨ζ(t),z(t)−z⋆⟩≤0

for almost all , where is as defined in (8) and the inequality follows from (1), monotonicity of . Therefore, is a nonincreasing function of . Therefore is bounded and

 limt→∞∥z(t)−z⋆∥=χ

for some limit since nonincreasing lower-bounded sequences have limits.

Let such that , i.e., is a limit point of . Then, . Since (with fixed ) is continuous by Lemma 4, we have

 limk→∞ϕs+tk(z0)=ϕs(ϕtk(z(0)))→ϕs(z∞)

for all . This means is also a limit point of and

 ∥ϕs(z∞)−z⋆∥=χ

for all . Therefore

 0=dds∥ϕs(z∞)−z⋆∥2∈−⟨G(ϕs(z∞)),ϕs(z∞)−z⋆⟩ (9)

for all .

Write and let . If

 ⟨G(x∞,u∞),(x∞,u∞)−(x⋆,u⋆)⟩>0

by strict convexity. In light of (9), we conclude .

Write . Then

 0 ∈⟨G(ϕs(z∞)),ϕs(z∞)−z⋆⟩ =⟨∂u(−L)(x⋆,ϕus(z∞))−∂u(−L)(x⋆,ϕs(z⋆)),ϕus(z∞)−u⋆⟩ ≥−L(x⋆,ϕus(z∞))+L(x⋆,u⋆) ≥0,

where the first inequality follows from concavity of in and the second inequality follows from the fact that is a maximizer when is fixed. Therefore, we have equality throughout, and , i.e., also maximizes and is a solution.

Finally, since is a solution, converges to a limit as . Since , we conclude that as . ∎

###### Lemma 6 (Theorem 3.7 of [12]).

Consider the update

 zk+1=zk−αk(ζk+ξk),ζk∈G(zk).

Define and

 x(t)=xk+t−tktk+1−tk(xk+1−xk),t∈[tk,tk+1).

Define the time-shifted process

 xτ(⋅)=x(τ+⋅).

Let the following conditions hold:

• The iterates are bounded, i.e., and .

• The stepsizes satisfy Assumption (A1).

• The weighted noise sequence converges: for some .

• For any increasing sequence such that , we have

 limn→∞dist(1mm∑k=1ζnk,G(z∞))=0.

Then for any sequence , the sequence of functions is relatively compact in . If , all limit points of satisfy the differential inclusion (8).

We verify the conditions of Lemma 6 and make the argument that the noisy discrete time process is close to the noiseless continuous time process and the two processes converge to the same limit.

##### Verifying conditions of Lemma 6.

Condition (i). Let . Write