ODE Analysis of Stochastic Gradient Methods with
Optimism and Anchoring
for
Minimax Problems and GANs
Abstract
Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze lastiterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convexconcavity, guided by a continuoustime analysis with differential equations. First, we show that simGD, as is, converges with stochastic subgradients under strict convexity in the primal variable. Second, we generalize optimistic simGD to accommodate an optimism rate separate from the learning rate and show its convergence with full gradients. Finally, we present anchored simGD, a new method, and show convergence with stochastic subgradients.
ODE Analysis of Stochastic Gradient Methods with
Optimism and Anchoring
for
Minimax Problems and GANs
Ernest K. Ryu Kun Yuan Wotao Yin Department of Mathematics, UCLA Electrical and Computer Engineering, UCLA DAMO Academy, Alibaba US eryu@math.ucla.edu, kunyuan@ucla.edu, wotao.yin@alibabainc.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Training of generative adversarial networks (GAN) [19], solving a minimax game using stochastic gradients, is known to be difficult. Despite the remarkable empirical success of GANs, further understanding the global training dynamics empirically and theoretically is considered a major open problem [18, 54, 39, 37, 48].
The local training dynamics of GANs is understood reasonably well. Several works have analyzed convergence assuming the loss functions have linear gradients and the training uses full (deterministic) gradients. Although the linear gradient assumption is reasonable for local analysis (even though the loss functions may not be continuously differentiable due to ReLU activation functions) such results say very little about global convergence. Although the full gradient assumption is reasonable when the learning rate is small, such results say very little about how the randomness affects the training.
This work investigates global convergence of simultaneous gradient descent (simGD) and its variants for zerosum games with a convexconcave cost using using stochastic subgradients. We specifically study convergence of the last iterates rather than the averaged iterates.
Section 2 presents convergence of simGD with stochastic subgradients under strict convexity in the primal variable. The goal is to establish a minimal sufficient condition of global convergence for simGD without modifications. Section 3 presents a generalization of optimistic simGD [8], which allows an optimism rate separate from the learning rate. We prove the generalized optimistic simGD using full gradients converges, and experimentally demonstrate that the optimism rate must be tuned separately from the learning rate when using stochastic gradients. However, it is unclear whether optimistic simGD is theoretically compatible with stochastic gradients. Section 4 presents anchored simGD, a new method, and presents its convergence with stochastic subgradients. The presentation and analyses of Sections 2, 3, and 4 are guided by continuoustime firstorder ordinary differential equations (ODE). In particular, we interpret optimism and anchoring as discretizations of certain regularized dynamics. Section 5 experimentally demonstrates the benefit of optimism and anchoring for training GANs in some setups.
Prior work.
There are several independent directions for improving the training of GANs such as designing better architectures, choosing good loss functions, or adding appropriate regularizers [54, 2, 60, 1, 20, 64, 59, 37, 38, 41]. In this work, we accept these factors as a given and focus on how to train (optimize) the model effectively.
Optimism is a simple modification to remedy the cycling behavior of simGD, which can occur even under the bilinear convexconcave setup [8, 9, 10, 36, 17, 28, 42, 50]. These prior work assume the gradients are linear and use full gradients. Although the recent name ‘optimism’ originates from its use in online optimization [6, 55, 56, 62], the idea dates back to Popov’s work in the 1980s [53] and has been studied independently in the mathematical programming community [35, 32, 34, 33, 7].
Classical literature analyze convergence of the Polyakaveraged iterates (which assigns less weight to newer iterates) when solving convexconcave saddle point problems using stochastic subgradients [47, 46, 24, 17]. For GANs, however, last iterates or exponentially averaged iterates [66] (which assigns more weight to newer iterates) are used in practice. Therefore, the classical work using Polyak averaging do not fully explain the empirical success of GANs.
The classical techniques used for the analyses of this work, the stochastic approximation technique [12, 22], ideas from control theory [22, 45], ideas from variational inequalities and monotone operator theory [16, 17], and continuoustime ODE analysis [22, 7], have been utilized for analyzing GANs.
2 Stochastic simultaneous subgradient descent
Consider the cost function and the minimax game . We say is a solution to the minimax game or a saddle point of if
We assume
(A0) 
By convexconcave, we mean is a convex function in for fixed and a concave function in for fixed . Define
where and respectively denote the convex subdifferential with respect to and . For simplicity, write and . Note that if and only if is a saddle point. Since is convexconcave, the operator is monotone [58]:
(1) 
Let be a stochastic subgradient oracle, i.e., for all , where is a random variable. Consider Simultaneous Stochastic SubGradient Descent
(SSSGD) 
for , where is a starting point, are positive learning rates, and are IID random variables. (We read SSSGD as “tripleSGD”.) In this section, we provide convergence of SSSGD when is strictly convex in .
2.1 Continuoustime illustration
To understand the asymptotic dynamics of the stochastic discretetime system, we consider a corresponding deterministic continuoustime system. For simplicity, assume is singlevalued and smooth. Consider
with an initial value . (We introduce for notational simplicity.) Let be a saddle point, i.e., . Then does not move away from
where we used (1). However, there is no mechanism forcing to converge to a solution.
Consider the two examples and with
(2) 
where and and . Note that is the canonical counter example that also arises as the DiracGAN [37]. See Figure 1.
The classical LaSalle–Krasnovskii invariance principle [26, 27] states that (paraphrased) if is a limit point of , then the dynamics starting at will have a constant distance to . On the left of Figure 1, we can see that is constant as for all . On the right of Figure 1, we can see that although when for (the dotted line) this derivative is temporary as will soon move past the dotted line. Therefore, can maintain a constant constant distance to only if it starts at , and is the only limit point of .
2.2 Discretetime convergenece analysis
Consider the further assumptions
(A1)  
(A2) 
where and are independent random variables and and . These assumptions are standard in the sense that analogous assumptions are used in convex minimization to establish almost sure convergence of stochastic gradient descent.
Theorem 1.
We can alternatively assume is strictly concave in for all and obtain the same result.
The proof uses the stochastic approximation technique of [12]. We show that the discretetime process converges (in an appropriate topology) to continuoustime trajectories satisfying a differential inclusion and use the LaSalle–Krasnovskii invariance principle to argue that limit points are solutions.
Related prior work.
Theorem 3.1 of [36] proves a similar convergence result under the stronger assumption of strict convexconcavity in both and for the more general mirror descent setup.
3 Simultaneous GD with optimism
Consider the setup where is continuously differentiable and we access full (deterministic) gradients
Consider Optimistic Simultaneous Gradient Descent
(SimGDO) 
for , where is a starting point, , is learning rate, and is the optimism rate. Optimism is a modification to simGD that remedies the cycling behavior; for the bilinear example of (2), simGD (case ) diverges while SimGDO with appropriate converges. In this section, we provide a continuoustime interpretation of SimGDO as a regularized dynamics and provide convergence for the deterministic setup.
3.1 Continuoustime illustration
Consider the regularized continuoustime dynamics
where is the Moreau–Yosida regularization of . With a change of variables we get
and the discretization and yields SimGDO.
We further explain. The Moreau–Yosida [43, 67] regularization of with parameter is
To clarify, is the identity mapping and is the inverse (as a function) of , which is welldefined by Minty’s theorem [40]. It is straightforward to verify that if and only if , i.e., and share the same equilibrium points. For small , we can think of as an approximation that is betterbehaved. Specifically, is merely monotone (satisfies (1)), but is furthermore cocoercive, i.e.,
(3) 
We reparameterize the dynamics with and to get and
This gives us .
We now investigate convergence. Let satisfy (and therefore ). Then
where we use cocoercivity, (3). This translates to
(4) 
The quantity is nonincreasing since
where we use cocoercivity, (3). Finally, integrating (4) on both sides gives us
This analysis was inspired by [3, 7]: Attouch et al. [3] studied continuoustime dynamics with Moreau–Yosida regularization and Csetnek et al. [7] interpreted a forwardbackwardforwardtype method as a discretization of continuoustime dynamics with the Douglas–Rachford operator.
Other interpretations of optimism.
Daskalakis et al. interprets optimism as augmenting “follow the regularized leader” with the (optimistic) prediction that the next gradient will be the same as the current gradient in online learning setup [8]. Peng et al. interprets optimism as “centripetal acceleration” [50] but does not provide a formal analysis with differential equations.
3.2 Discretetime convergenece analysis
The discretetime method SimGDO converges under the assumption
is differentiable and is Lipschitz continuous.  (A3) 
Theorem 2.
The proof can be considered a discretization of the continuoustime analysis.
Related prior work.
3.3 Difficulty with stochastic gradients
Training in machine learning usually relies on stochastic gradients, rather than full gradietns. We can consider a stochastic variation of SimGDO:
(SimGDOS) 
with learning rate and optimism rate .
Figure 2 presents experiments of SimGDOS on a simple bilinear problem. The choice where does not lead to convergence. Discretizing with a diminishing step leads to the choice and , but this choice as well does not converge. Rather, both and must be diminishing and , i.e., must diminish faster than for convergence. Rather, it is necessary to tune and separately as in Theorem 2 to obtain convergence and dynamics appear to be sensitive to the choice of and . One explanation of this difficulty is that the finite difference approximation is unreliable when using stochastic gradients.
Whether the observed convergence holds generally in the nonlinear convexconcave setup and whether optimism is compatible with subgradients is unclear. This motivates anchoring of the following section which is provably compatible with stochastic subgradients.
Related prior work.
Gidel et al. [17] show averaged iterates of SimGDOS converges if iterates are projected onto a compact set. Mertikopoulos et al. [36] show almost sure convergence of SimGDOS under strict convexconcavity. However, such analyses do not provide a compelling reason to use optimism since SimGD without optimism already converges under these setups.
4 Simultaneous GD with anchoring
Consider setup of Section 3. We propose Anchored Simultaneous Gradient Descent
for , where is a starting point, , and is the anchor rate. The last term, the anchoring term, was inspired by Halpern’s method [21, 65, 29] and James–Stein estimator [61, 23]. In this section, we provide a continuoustime illustration of SimGDA and provide convergence for both the deterministic and stochastic setups.
4.1 Continuoustime illustration
Consider the continuoustime dynamics
for , where and . We obtain SimGDA by discretizing this ODE with diminishing steps .
Define . Then
Using this, we have
Using , we have
Multiplying by and integrating both sides gives us
Reorganizing, we get
Using , the monotonicity inequality, and Young’s inequality, we get
and conclude
Interestingly, anchoring leads to a faster rate compared to the rate of optimism in continuous time. The discretized method, however, is not faster than .
4.2 Discretetime convergenece analysis and compatibility with stochastic subgradients
The constant is computable, although it is complicated. The proof can be considered a discretization of the continuoustime analysis.
Consider the setup of Section 2. We propose Anchored Simultaneous Stochastic SubGradient Descent
Theorem 4.
To the best of our knowledge, Theorem 4 is the first result establishing lastiterate convergence for convexconcave cost functions using stochastic subgradients.
5 Experiments
In this section, we experimentally demonstrate the effectiveness of optimism and anchoring for training GANs. We train WassersteinGANs [2] with gradient penalty [20] on the MNIST and CIFAR10 dataset and plot the Fréchet Inception Distance (FID) [22, 30]. The experiments were implemented in PyTorch [49]. We combine Adam with optimism and anchoring (described precisely in Appendix F) and compare it against the baseline Adam optimizer [25]. The generator and discriminator architectures and the hyperparameters are described in Appendix F. For optimistic and anchored Adam, we roughly tune the optimism and anchor rates and show the curve corresponding to the best parameter choice. Figure 3 shows an ensemble of samples generated at the end of the training period.
Figure 4 shows that the MNIST setup benefits from anchoring but not from optimism, while the CIFAR10 setup benefits from optimism but not from anchoring. We leave comparing the effects of optimism and anchoring in practical GAN training (where the cost function is not convexconcave) as a topic of future work.
6 Conclusion
In this work, we analyzed the convergence of SSSGD, Optimistic simGD, and Anchored SSSGD. Under the assumption that the cost is convexconcave, Anchored SSSGD provably converges under the most general setup. Through experiments, we showed that the practical GAN training benefits from optimism and anchoring in some (but not all) setups.
Generalizing these results to accommodate projections and proximal operators, analogous to projected and proximal gradient methods, is an interesting direction of future work. Weight clipping [2] and spectral normalization [41] are instances where projections are used in training GANs.
Acknowledgments
We thank Yura Malitsky and Matthew Tam for the discussion on the reflection mechanism, which inspired our work of Section 3. We thank Adrien Taylor who brought to our attention recent work on the Halpern iteration, which inspired our work of Section 4. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. This work was partially supported by AFOSR MURI FA95501810502, NSF DMS1720237, and ONR N0001417121.
References
 [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. ICLR, 2017.
 [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, pages 214–223, 2017.
 [3] H. Attouch and J. Peypouquet. Convergence of inertial dynamics and proximal algorithms governed by maximally monotone operators. Math. Program., 174(1):391–432, 2019.
 [4] J. P. Aubin and A. Cellina. Differential Inclusions: SetValued Maps and Viability Theory. SpringerVerlag, 1984.
 [5] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. SpringerVerlag, 2nd edition, 2017.
 [6] C.K. Chiang, T. Yang, C.J. Lee, M. Mahdavi, C.J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. COLT, 2012.
 [7] E. R. Csetnek, Y. Malitsky, and M. K. Tam. Shadow Douglas–Rachford splitting for monotone inclusions. arXiv:1903.03393, 2019.
 [8] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. ICLR, 2018.
 [9] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in minmax optimization. NeurIPS, 2018.
 [10] C. Daskalakis and I. Panageas. Lastiterate convergence: Zerosum games and constrained minmax optimization. ITCS, 2019.
 [11] A. Dembo. Lecture notes on probability theory: Stanford statistics 310. http://statweb.stanford.edu/~adembo/stat310b/lnotes.pdf, 2019. Accessed: 20190510.
 [12] J. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim., 28(4):3229–3259, 2018.
 [13] H. Edwards and A. Storkey. Censoring representations with an adversary. ICLR, 2016.
 [14] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. ICML, 2015.
 [15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky. Domainadversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
 [16] I. Gemp and S. Mahadevan. Global convergence to the equilibrium of GANs using variational inequalities. arXiv:1808.01531, 2018.
 [17] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial networks. ICLR, 2019.
 [18] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv:1701.00160, 2016.
 [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NeurIPS, 2014.
 [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. NeurIPS, 2017.
 [21] B. Halpern. Fixed points of nonexpanding maps. Bull. Amer. Math. Soc., 73(6):957–961, 1967.
 [22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local Nash equilibrium. NeurIPS, 2017.
 [23] W. James and C. Stein. Estimation with quadratic loss. In J. Neyman, editor, Proc. Fourth Berkeley Symp. Math. Statist. Prob, volume 1, pages 361–379, 1961.
 [24] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stoch. Syst., 1(1):17–58, 2011.
 [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 [26] N. N. Krasovskii. Some Problems in the Theory of Motion Stability. Fizmatgiz, Moscow, 1959.
 [27] J. LaSalle. Some extensions of Liapunov’s second method. IRE Trans. Circuit Theory, 7(4):520–527, 1960.
 [28] T. Liang and J. Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. AISTATS, 2019.
 [29] F. Lieder. On the convergence rate of the Halperniteration. Optimization Online:2017116336, 2017.
 [30] M. Lucic, K. Kurach, M. Michalski, O. Bousquet, and S. Gelly. Are GANs created equal? a largescale study. NeurIPS, 2018.
 [31] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. ICLR, 2016.
 [32] Y. Malitsky. Projected reflected gradient methods for monotone variational inequalities. SIAM J. Optim., 25(1):502–520, 2015.
 [33] Y. Malitsky. Golden ratio algorithms for variational inequalities. arXiv:1803.08832, 2018.
 [34] Y. Malitsky and M. K. Tam. A forwardbackward splitting method for monotone inclusions without cocoercivity. arXiv:1808.04162, 2018.
 [35] Y. V. Malitsky and V. V. Semenov. An extragradient algorithm for monotone variational inequalities. Cybern. Syst. Anal., 50(2):271–277, 2014.
 [36] P. Mertikopoulos, B. Lecouat, H. Zenati, C.S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddlepoint problems: Going the extra(gradient) mile. ICLR, 2019.
 [37] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? ICML, 2018.
 [38] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. NeurIPS, 2017.
 [39] L. Metz, B. Poole, D. Pfau, and J. SohlDickstein. Unrolled generative adversarial networks. ICLR, 2017.
 [40] G. J. Minty. Monotone (nonlinear) operators in Hilbert space. Duke Math. J., 29(3):341–346, 1962.
 [41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
 [42] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extragradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv:1901.08511, 2019.
 [43] J. J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93:273–299, 1965.
 [44] E. Moulines and F. Bach. Nonasymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 2011.
 [45] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. NeurIPS, 2017.
 [46] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
 [47] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
 [48] A. Odena. Open questions about generative adversarial networks (online article). https://distill.pub/2019/ganopenproblems/, 2019. Accessed: 20190510.
 [49] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. NeurIPS Autodiff Workshop, 2017.
 [50] W. Peng, Y. Dai, H. Zhang, and L. Cheng. Training GANs with centripetal acceleration. arXiv:1902.08949, 2019.
 [51] D. Pfau and O. Vinyals. Connecting generative adversarial networks and actorcritic methods. NeurIPS Workshop on Adversarial Training, 2016.
 [52] B. T. Polyak. Introduction to optimization. Optimization Software, 1987.
 [53] L. D. Popov. A modification of the Arrow–Hurwicz method for search of saddle points. Mat. Zametki, 28(5):777–784, 1980.
 [54] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016.
 [55] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. COLT, 2013.
 [56] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. NeurIPS, 2013.
 [57] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics, pages 233–257. Academic Press, 1971.
 [58] R. T. Rockafellar. Monotone operators associated with saddlefunctions and minimax problems. In F. E. Browder, editor, Nonlinear Functional Analysis, Part 1, volume 18 of Proceedings of Symposia in Pure Mathematics, pages 241–250. American Mathematical Society, 1970.
 [59] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. NeurIPS, 2017.
 [60] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP inference for image superresolution. ICLR, 2017.
 [61] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In J. Neyman, editor, Proc. Third Berkeley Symp. Math. Statist. Prob., volume 1, pages 197–206, 1956.
 [62] V. Syrgkanis, A. Agarwal, H. Luo, and R. E. Schapire. Fast convergence of regularized learning in games. NeurIPS, 2015.
 [63] A. Taylor and F. Bach. Stochastic firstorder methods: nonasymptotic and computeraided analyses via potential functions. COLT, 2019.
 [64] X. Wei, B. Gong, Z. Liu, W. Lu., and L. Wang. Improving the improved training of Wasserstein GANs: A consistency term and its dual effect. ICLR, 2018.
 [65] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Arch. Math., 58(5):486–491, 1992.
 [66] Y. Yazıcı, C.S. Foo, S. Winkler, K.H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectiveness of averaging in GAN training. ICLR, 2019.
 [67] K. Yosida. On the differentiability and the representation of oneparameter semigroup of linear operators. J. Math. Soc. Japan, 1(1):15–21, 1948.
Appendix A Notation and preliminaries
Write to denote the set of nonnegative real numbers and to denote inner product, i.e., for .
We say is a pointtoset mapping on if maps points of to subsets of . For notational simplicity, we write
Using this notation, we define monotonicity of with
where the inequality requires every member of the set to be nonnegative. We say a monotone operator is maximal if there is no other monotone operator such that the containment
is proper. If is convexconcave, then the subdifferential operator
is maximal monotone [58]. By [5, Proposition 20.36], is closedconvex for any . By [5, Proposition 20.38(iii)] maximal monotone operators are upper semicontinuous in the sense that if is maximal monotone, then for and imply . (In other words, the graph of is closed.) Define , which is the set of saddlepoints or equilibrium points. When is maximal monotone, is a closed convex set. Write
for the projection onto .
Write for the space of valued continuous functions on . For , we say in if uniformly on bounded intervals, i.e., for all , we have
In other words, we consider the topology of uniform convergence on compact sets.
We rely on the following inequalities, which hold for any any .
(5)  
(6) 
In particular, (6) is called Young’s inequality.
Lemma 1 (Theorem 5.3.33 [11]).
Let be a martingale such that
for all and
then converges almost surely to a limit.
Lemma 2 (Robbins–Siegmund [57]).
Let , , , and be nonnegative measurable random sequences satisfying
If, furthermore,
holds almost surely, then
almost surely, where is a random limit.
Define
Note that is possible even if when is not continuously differentiable.
Appendix B Proof of Theorem 1
Consider the differential inclusion
(7) 
with the initial condition . We say satisfies (7) if there is a Lebesgue integrable such that
(8) 
Lemma 4 (Theorem 5.2.1 [4]).
If is maximal monotone, the solution to (7) exists and is unique. Furthermore, is Lipschitz continuous for all .
Write and call the time evolution operator. In other words, maps the initial condition of the differential inclusion to the point at time .
Lemma 5 (LaSalle–Krasnovskii).
If satisfies (7), then as and .
This proof can be considered an adaptation of the LaSalle–Krasnovskii invariance principle [26, 27] to the setup of differential inclusions. The standard result applies to differential equations.
Proof.
Consider any , which exists by Assumption (A0). Since is absolutely continuous, so is , and we have
for almost all , where is as defined in (8) and the inequality follows from (1), monotonicity of . Therefore, is a nonincreasing function of . Therefore is bounded and
for some limit since nonincreasing lowerbounded sequences have limits.
Let such that , i.e., is a limit point of . Then, . Since (with fixed ) is continuous by Lemma 4, we have
for all . This means is also a limit point of and
for all . Therefore
(9) 
for all .
Write . Then
where the first inequality follows from concavity of in and the second inequality follows from the fact that is a maximizer when is fixed. Therefore, we have equality throughout, and , i.e., also maximizes and is a solution.
Finally, since is a solution, converges to a limit as . Since , we conclude that as . ∎
Lemma 6 (Theorem 3.7 of [12]).
Consider the update
Define and
Define the timeshifted process
Let the following conditions hold:

The iterates are bounded, i.e., and .

The stepsizes satisfy Assumption (A1).

The weighted noise sequence converges: for some .

For any increasing sequence such that , we have
Then for any sequence , the sequence of functions is relatively compact in . If , all limit points of satisfy the differential inclusion (8).
We verify the conditions of Lemma 6 and make the argument that the noisy discrete time process is close to the noiseless continuous time process and the two processes converge to the same limit.
Verifying conditions of Lemma 6.
Condition (i).
Let . Write