Training GANs with Optimism
Abstract
We address the issue of limit cycling behavior in training Generative Adversarial Networks and propose the use of Optimistic Mirror Decent (OMD) for training Wasserstein GANs. Recent theoretical results have shown that optimistic mirror decent (OMD) can enjoy faster regret rates in the context of zerosum games. WGANs is exactly a context of solving a zerosum game with simultaneous noregret dynamics. Moreover, we show that optimistic mirror decent addresses the limit cycling problem in training WGANs. We formally show that in the case of bilinear zerosum games the last iterate of OMD dynamics converges to an equilibrium, in contrast to GD dynamics which are bound to cycle. We also portray the huge qualitative difference between GD and OMD dynamics with toy examples, even when GD is modified with many adaptations proposed in the recent literature, such as gradient penalty or momentum. We apply OMD WGAN training to a bioinformatics problem of generating DNA sequences. We observe that models trained with OMD achieve consistently smaller KL divergence with respect to the true underlying distribution, than models trained with GD variants. Finally, we introduce a new algorithm, Optimistic Adam, which is an optimistic variant of Adam. We apply it to WGAN training on CIFAR10 and observe improved performance in terms of inception score as compared to Adam.
Training GANs with Optimism
Constantinos Daskalakis^{†}^{†}thanks: These authors contribute equally to this work. 
MIT, EECS 
costis@mit.edu 
Andrew Ilyas^{†}^{†}footnotemark: 

MIT, EECS 
ailyas@mit.edu 
Vasilis Syrgkanis^{†}^{†}footnotemark: 

Microsoft Research 
vasy@microsoft.com 
Haoyang Zeng^{†}^{†}footnotemark: 

MIT, EECS 
haoyangz@mit.edu 
1 Introduction
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have proven a very successful approach for fitting generative models in complex structured spaces, such as distributions over images. GANs frame the question of fitting a generative model from a data set of samples from some distribution as a zerosum game between a Generator (G) and a discriminator (D). The Generator is represented as a deep neural network which takes as input random noise and outputs a sample in the same space of the sampled data set, trying to approximate a sample from the underlying distribution of data. The discriminator, also modeled as a deep neural network is attempting to discriminate between a true sample and a sample generated by the generator. The hope is that at the equilibrium of this zerosum game the generator will learn to generate samples in a manner that is indistinguishable from the true samples and hence has essentially learned the underlying data distribution.
Despite their success at generating visually appealing samples when applied to image generation tasks, GANs are very finicky to train. One particular problem, raised for instance in a recent survey as a major issue (Goodfellow, 2017) is the instability of the training process. Typically training of GANs is achieved by solving the zerosum game via running simultaneously a variant of a Stochastic Gradient Descent algorithm for both players (potentially training the discriminator more frequently than the generator).
The latter amounts essentially to solving the zerosum game via running noregret dynamics for each player. However, it is known from results in game theory, that noregret dynamics in zerosum games can very often lead to limit oscillatory behavior, rather than converge to an equilibrium. Even in convexconcave zerosum games it is only the average of the weights of the two players that constitutes an equilibrium and not the lastiterate. In fact recent theoretical results of Mertikopoulos et al. (2017) show the strong result that no variant of GD that falls in the large class of FollowtheRegularizedLeader (FTRL) algorithms can converge to an equilibrium in terms of the lastiterate and are bound to converge to limit cycles around the equilibrium.
Averaging the weights of neural nets is a prohibitive approach in particular because the zerosum game that is defined by training one deep net against another is not a convexconcave zerosum game. Thus it seems essential to identify training algorithms that make the last iterate of the training be very close to the equilibrium, rather than only the average.
Contributions.
In this paper we propose training GANs, and in particular Wasserstein GANs Arjovsky et al. (2017), via a variant of gradient descent known as Optimistic Mirror Descent. Optimistic Mirror Descent (OMD) takes advantage of the fact that the opponent in a zerosum game is also training via a similar algorithm and uses the predictability of the strategy of the opponent to achieve faster regret rates. It has been shown in the recent literature that Optimistic Mirror Descent and its generalization of Optimistic FollowtheRegularizedLeader (OFTRL), achieve faster convergence rates than gradient descent in convexconcave zerosum games (Rakhlin & Sridharan, 2013a; b) and even in general normal form games (Syrgkanis et al., 2015). Hence, even from the perspective of faster training, OMD should be preferred over GD due to its better worstcase guarantees and since it is a very small change over GD.
Moreover, we prove the surprising theoretical result that for a large class of zerosum games (namely bilinear games), OMD actually converges to an equilibrium in terms of the last iterate. Hence, we give strong theoretical evidence that OMD can help in achieving the long soughtafter stability and lastiterate convergence required for GAN training. The latter theoretical result is of independent interest, since solving zerosum games via noregret dynamics has found applications in many areas of machine learning, such as boosting (Freund & Schapire, 1996). Avoiding limit cycles in such approaches could help improve the performance of the resulting solutions.
We complement our theoretical result with toy simulations that portray exactly the large qualitative difference between OMD as opposed to GD (and its many variants, including gradient penalty, momentum, adaptive step size etc.). We show that even in a simple distribution learning setting where the generator simply needs to learn the mean of a multivariate distribution, GD leads to limit cycles, while OMD converges pointwise.
Moreover, we give a more complex application to the problem of learning to generate distributions of DNA sequences of the same cellular function. DNA sequences that carry out the same function in the genome, such as binding to a specific transcription factor, follow the same nucleotide distribution. Characterizing the DNA distribution of different cellular functions is essential for understanding the functional landscape of the human genome and predicting the clinical consequence of DNA mutations (Zeng et al., 2015; 2016; Zeng & Gifford, 2017). We perform a simulation study where we generate samples of DNA sequences from a known distribution. Subsequently we train a GAN to attempt to learn this underlying distribution. We show that OMD achieves consistently better performance than GD variants in terms of the KullbackLeibler (KL) divergence between the distribution learned by the Generator and the true distribution.
Finally, we apply optimism to training GANs for images and introduce the Optimistic Adam algorithm. We show that it achieves better performance than Adam, in terms of inception score, when trained on CIFAR10.
2 Preliminaries: WGANs and Optimistic Mirror Descent
We consider the problem of learning a generative model of a distribution of data points . Our goal is given a set of samples from , to learn an approximation to the distribution in the form of a deep neural network , with weight parameters , that takes as input random noise (from some simple distribution ) and outputs a sample . We will focus on addressing this problem via a Generative Adversarial Network (GAN) training strategy.
The GAN training strategy defines as a zerosum game between a generator deep neural network and a discriminator neural network . The generator takes as input random noise , and outputs a sample . A discriminator takes as input a sample (either drawn from the true distribution or from the generator) and attempts to classify it as real or fake. The goal of the generator is to fool the discriminator.
In the original GAN training strategy Goodfellow et al. (2014), the discriminator of the zero sum game was formulated as a classifier, i.e. with a multinomial logistic loss. The latter boils down to the following expected zerosum game (ignoring sampling noise).
(1) 
If the discriminator is very powerful and learns to accurately classify all samples, then the problem of the generator amounts to minimizing the JensenShannon divergence between the true distribution and the generators distribution. However, if the discriminator is very powerful, then in practice the latter leads to vainishing gradients for the generator and inability to train in a stable manner.
The latter problem lead to the formulation of Wasserstein GANs (WGANs) Arjovsky et al. (2017), where the discriminator rather than being treated as a classifier (equiv. approximating the JS divergence) is instead trying to approximate the Wasserstein or earthmover metric between the true distribution and the distribution of the generator. In this case, the function is not constrained to being a probability in but rather is an arbitrary Lipschitz function of . This reasoning leads to the following zerosum game:
(2) 
If the function space of the discriminator covers all Lipschitz functions of , then the quantity that the generator is trying to minimize corresponds to the earthmover distance between the true distribution and the distribution of the generator. Given the success of WGANs we will focus on WGANs in this paper.
2.1 Gradient Descent vs Optimistic Mirror Descent
The standard approach to training WGANs is to train simultaneously the parameters of both networks via stochastic gradient descent. We begin by presenting the most basic version of adversarial training via stochastic gradient descent and then comment on the multiple variants that have been proposed in the literature in the following section, where we compare their performance with our proposed algorithm for a simple example.
Let us start how training a GAN with gradient descent would look like in the absence of sampling error, i.e. if we had access to the true distribution . For simplicity of notation, let:
(3) 
denote the loss in the expected zerosum game of WGAN, as defined in Equation (2), i.e. . The classic WGAN approach is to solve this game by running gradient descent (GD) for each player, i.e. for : with and
(4)  
If the loss function was convex in and concave , and lie in some bounded convex set and the step size is chosen of the order , then standard results in game theory and noregret learning (see e.g. Freund & Schapire (1999)) imply that the pair of average parameters, i.e. and is an equilibrium of the zerosum game, for . However, no guarantees are known beyond the convexconcave setting and, more importantly for the paper, even in convexconcave games, no guarantees are known for the lastiterate pair .
Rakhlin and Sridharan (Rakhlin & Sridharan, 2013a) proposed an alternative algorithm for solving zerosum games in a decentralized manner, namely Optimistic Mirror Descent (OMD), that achieves faster convergence rates to equilibrium of for the average of parameters. The algorithm essentially uses the last iterations gradient as a predictor for the next iteration’s gradient. This follows from the intuition that if the opponent in the game is using a stable (or regularized) algorithm, then the gradients between the two iterations will not change much. Later Syrgkanis et al. (2015) showed that this intuition extends to show faster convergence of each individual player’s regret in general normal form games.
Given these favorable properties of OMD when learning in games, we propose replacing GD with OMD when training WGANs. The update rule of a OMD is a small adaptation to GD. OMD is parameterized by a predictor of the next iteration’s gradient which could either be simply last iteration’s gradient or an average of a window of last gradient or a discounted average of past gradients. In the case where the predictor is simply the last iteration gradient, then the update rule for OMD boils down to the following simple form:
(5)  
The simple modification in the GD update rule, is inherently different than any of the existing adaptations used in GAN training, such as Nesterov’s momentum, or gradient penalty.
General OMD and intuition.
The intuition behind OMD can be more easily understood when GD is viewed through the lens of the FollowtheRegularizedLeader formulation. In particular, it is well known that GD is equivalent to the FollowtheRegularizedLeader algorithm with an regularizer (see e.g. ShalevShwartz (2012)), i.e.:
(6)  
It is known that if the learner knew in advance the gradient at the next iteration, then by adding that to the above optimization would lead to constant regret that comes solely from the regularization term^{1}^{1}1The latter is a consequence of the betheleader lemma Kalai & Vempala (2005); Rigollet (2015). OMD essentially augments FTRL by adding a predictor of the next iterations gradient, i.e.:
(7)  
For an arbitrary set of predictors, the latter boils down to the following set of update rules:
(8)  
In the theoretical part of the paper we will focus on the case where the predictor is simply the last iteration gradient, leading to update rules in Equation (5). In the experimental section we will also explore performance of other alternatives for predictors.
2.2 Stochastic Gradient Descent and Stochastic Optimistic Mirror Descent
In practice we don’t really have access to the true distribution and hence we replace with an empirical distribution over samples and of random noise samples , leading to empirical loss for the zerosum game of:
(9) 
Even in this setting it might be impractical to compute the gradient of the expected loss with respect to or , e.g. .
However, GD and OMD still leads to small loss if we replace gradients with unbiased estimators of them. Hence, we can replace expectation with respect to or , by simply evaluating the gradients at a single sample or on a small batch of samples. Hence, we can replace the gradients at each iteration with the variants:
(10)  
Replacing and with the above estimates in Equation (4) and (5), leads to Stochastic Gradient Descent (SGD) and Stochastic Optimistic Mirror Decent (SOMD) correspondingly.
3 An Illustrative Example: Learning the Mean of a Distribution
We consider the following very simple WGAN example: The data are generated by a multivariate normal distribution, i.e. for some . The goal is for the generator to learn the unknown parameter . In Appendix C we also consider a more complex example where the generator is trying to learn a covariance matrix.
We consider a WGAN, where the discriminator is a linear function and the generator is a simple additive displacement of the input noise , which is drawn from , i.e:
(11)  
The goal of the generator is to figure out the true distribution, i.e. to converge to . The WGAN loss then takes the simple form:
(12) 
We first consider the case where we optimize the true expectations above rather than assuming that we only get samples of and samples of . Due to linearity of expectation, the expected zerosum game takes the form:
(13) 
We see here that the unique equilibrium of the above game is for the generator to choose and for the discriminator to choose . For this simple zero sum game, we have and . Hence, the GD dynamics take the form:
(GD Dynamics for Learning Means)  
while the OMD dynamics take the form:
(OMD Dynamics for Learning Means)  
We simulated simultaneous training in this zerosum game under the GD and under OMD dynamics and we find that GD dynamics always lead to a limit cycle irrespective of the step size or other modifications. In Figure 1 we present the behavior of the GD vs OMD dynamics in this game for . We see that even though GD dynamics leads to a limit cycle (whose average does indeed equal to the true vector), the OMD dynamics converge to in terms of the last iterate. In Figure 2 we see that the stability of OMD even carries over to the case of Stochastic Gradients, as long as the batch size is of decent size.
In the appendix we also portray the behavior of the GD dynamics even when we add gradient penalty (Gulrajani et al., 2017) to the game loss (instead of weight clipping), adding Nesterov momentum to the GD update rule (Nesterov, 1983) or when we train the discriminator multiple times in between a train iteration of the generator. We see that even though these modifications do improve the stability of the GD dynamics, in the sense that they narrow the band of the limit cycle, they still lead to a nonvanishing limit cycle, unlike OMD.
In the next section, we will in fact prove formally that for a large class of zerosum games including the one presented in this section, OMD dynamics converge to equilibrium in the sense of lastiterate convergence, as opposed to averageiterate convergence.
4 LastIterate Convergence of Optimistic Adversarial Training
In this section, we show that Optimistic Mirror Descent exhibits finaliterate, rather than only averageiterate convergence to minmax solutions for bilinear functions. More precisely, we consider the problem , for some matrix , where and are unconstrained. In Appendix D, we also show that our convergence result appropriately extends to the general case, where the bilinear game also contains terms that are linear in the players’ individual strategies, i.e. games of the form:
(14) 
In the simpler problem, Optimistic Mirror Descent takes the following form, for all :
(15)  
(16) 
Initialization: For the above iteration to be meaningful we need to specify . We choose any , and , and set and , where represents the column space of . In particular, our initialization means that the first step taken by the dynamics gives and .
We will analyze Optimistic Mirror Descent under the assumption , where and denotes spectral norm of matrices. We can always enforce that by appropriately scaling . Scaling by some positive factor clearly does not change the minmax solutions , only scales the optimal value by the same factor.
We remark that the set of equilibrium solutions of this minimax problem are pairs such that is in the null space of and is in the null space of . In this section we rigorously show that Optimistic Mirror Descent converges to the set of such minmax solutions. This is interesting in light of the fact that Gradient Descent actually diverges, even in the special case where is the identity matrix, as per the following proposition whose proof is provided in Appendix D.3.
Proposition 1.
Gradient descent applied to the problem diverges starting from any initialization such that .
Next, we state our main result of this section, whose proof can be found in Appendix D, where we also state its appropriate generalization to the general case (14).
Theorem 1 (Last Iterate Convergence of OMD).
Consider the dynamics of Eq. (15) and (16) and any initialization , and . Let also
where for a matrix we denote by its generalized inverse and by its spectral norm. Suppose that and that is a small enough constant satisfying . Letting , the OMD dynamics satisfy the following:
In particular, , as , and for large enough , the last iterate of OMD is within distance from the space of equilibrium points of the game, where is the distance of the initial point from the equilibrium space, and where both distances are taken with respect to the norm .
5 Experimental Results for Generating DNA Sequences
We take our theoretical intuition to practice, applying OMD to the problem of generating DNA sequences from an observed distribution of sequences. DNA sequences that carry out the same function can be viewed as samples from some distribution. For many important cellular functions, this distribution can be well modeled by a positionweight matrix (PWM) that specifies the probability of different nucleotides occuring at each position (Stormo, 2000). Thus, training GANs from DNA sequences sampled from a PWM distribution serves as a practically motivated problem where we know the ground truth and can thus quantify the performance of different training methods in terms of the KL divergence between the trained generator distribution and the true distribution.
In our experiments, we generated 40,000 DNA sequences of six nucleotides according to a given position weight matrix. A random 10% of the sequences were held out as the validation set. Each sequence was then embedded into a matrix by encoding each of the four nucleotides with an onehot vector. On this dataset, we trained WGANs with different variants of OMD and SGD and evaluated their performance in terms of the KL divergence between the empirical distribution of the WGANgenerated samples and the true distribution described by the position weight matrix. Both the discriminator and generator of the WGAN used in this analysis were chosen to be convolutional neural networks (CNN), given the recent success of CNNs in modeling DNAprotein binding (Zeng et al., 2016; Alipanahi et al., 2015). The detailed structure of the chosen CNNs can be found in Appendix E.
To account for the impact of learning rate and training epochs, we explored two different ways of model selection when comparing different optimization strategies: (1) using the iteration and learning rate that yields the lowest discriminator loss on the held out test set. This is inspired by the observation in Arjovsky et al. (2017) that the discriminator loss negatively correlates with the quality of the generated samples. (2) using the model obtained after the last epoch of the training. To account for the stochastic nature of the initialization and optimizers, we trained 50 independent models for each learning rate and optimizer, and compared the optimizer strategies by the resulting distribution of KL divergences across 50 runs.
For GD, we used variants of Equation (4) to examine the effect of using momentum and an adaptive step size. Specifically, we considered momentum, Nesterov momentum and Adagrad. The specific form of all these modifications is given for reference in Appendix A.
For OMD we used the general predictor version of Equation (10) with a fixed step size and with the following variants of the next iteration predictor : (v1) Last iteration gradient: , (v2) Running average of past gradients: , (v3) Hyperbolic discounted average of past gradients: . We explored two training schemes: (1) training the discriminator 5 times for each generator training as suggest in Arjovsky et al. (2017). (2) training the discriminator once for each generator training. The latter is inline with the intuition behind the use of optimism: optimism hinges on the fact that the gradient at the next iteration is very predictable since it is coming from another regularized algorithm, and if we train the other algorithm multiple times, then the gradient is not that predictable and the benefits of optimism are lost.
For all aforedescribed algorithms, we experimented with their stochastic variants. Figure 3 shows the KL divergence between the WGANgenerated samples and the true distribution. When evaluated by the epoch and learning rate that yields the lowest discriminator loss on the validation set, WGAN trained with Stochastic OMD (SOMD) achieves lower KL divergence than the competing SGD variants. Evaluated by the last epoch, the best performance across different learning rates is achieved by optimistic Adam (see Section 6). We note that in both metrics, SOMD with 1:1 generatordiscriminator training ratio yields better KL divergence than the alternative training scheme (1:5 ratio), which validates the intuition behind the use of optimism.
6 Generating Images from CIFAR10 with Optimistic Adam
In this section we applying optimistic WGAN training to generating images, after training on CIFAR10. Given the success of Adam on training image WGANs we will use an optimistic version of the Adam algorithm, rather than vanilla OMD. We denote the latter by Optimistic Adam. Optimistic Adam could be of independent interest even beyond training WGANs. We present Optimistic Adam for (G) but the analog is also used for training (D).
We trained on CIFAR10 images with Optimistic Adam with the hyperparameters matched to Gulrajani et al. (2017), and we observe that it outperforms Adam in terms of inception score (see Figure 14), a standard metric of quality of WGANs (Gulrajani et al., 2017; Salimans et al., 2016). In particular we see that optimistic Adam achieves high numbers of inception scores after very few epochs of training. We observe that for Optimistic Adam, training the discriminator once after one iteration of the generator training, which matches the intuition behind the use of optimism, outperforms the 1:5 generatordiscriminator training scheme. We see that vanilla Adam performs poorly when the discriminator is trained only once in between iterations of the generator training. Moreover, even if we use vanilla Adam and train times (D) in between a training of (G), as proposed by Arjovsky et al. (2017), then performance is again worse than Optimistic Adam with a 1:1 ratio of training. The same learning rate and betas () as in Appendix B of Gulrajani et al. (2017) were used for all the methods compared. We also matched other hyperparameters such as gradient penalty coefficient and batch size. For a larger sample of images see Appendix G.
References
 Alipanahi et al. (2015) Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831–838, 2015.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Freund & Schapire (1996) Yoav Freund and Robert E. Schapire. Game theory, online prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, COLT ’96, pp. 325–332, New York, NY, USA, 1996. ACM. ISBN 0897918118. doi: 10.1145/238061.238163. URL http://doi.acm.org/10.1145/238061.238163.
 Freund & Schapire (1999) Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79 – 103, 1999. ISSN 08998256. doi: https://doi.org/10.1006/game.1999.0738. URL http://www.sciencedirect.com/science/article/pii/S0899825699907388.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423generativeadversarialnets.pdf.
 Goodfellow (2017) Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. CoRR, abs/1701.00160, 2017. URL http://arxiv.org/abs/1701.00160.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. CoRR, abs/1704.00028, 2017. URL http://arxiv.org/abs/1704.00028.
 Kalai & Vempala (2005) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 – 307, 2005. ISSN 00220000. doi: https://doi.org/10.1016/j.jcss.2004.10.016. URL http://www.sciencedirect.com/science/article/pii/S0022000004001394. Learning Theory 2003.
 Mertikopoulos et al. (2017) Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. arXiv preprint arXiv:1709.02738, 2017.
 Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.
 Rakhlin & Sridharan (2013a) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, pp. 993–1019, 2013a.
 Rakhlin & Sridharan (2013b) Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, pp. 3066–3074, USA, 2013b. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999954.
 Rigollet (2015) Philippe Rigollet. Mit 18.657: Mathematics of machine learning, lecture 16. 2015.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 ShalevShwartz (2012) Shai ShalevShwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, February 2012. ISSN 19358237. doi: 10.1561/2200000018. URL http://dx.doi.org/10.1561/2200000018.
 Stormo (2000) Gary D Stormo. Dna binding sites: representation and discovery. Bioinformatics, 16(1):16–23, 2000.
 Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, pp. 2989–2997, 2015.
 Zeng & Gifford (2017) Haoyang Zeng and David K Gifford. Predicting the impact of noncoding variants on dna methylation. Nucleic Acids Research, 2017.
 Zeng et al. (2015) Haoyang Zeng, Tatsunori Hashimoto, Daniel D Kang, and David K Gifford. Gerv: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics, 32(4):490–496, 2015.
 Zeng et al. (2016) Haoyang Zeng, Matthew D Edwards, Ge Liu, and David K Gifford. Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121–i127, 2016.
Appendix A Variants of GD Training
For ease of reference we briefly describe the exact form of update rules for several modifications of GD training that we have used in our experimental results.
Adagrad:
(17)  
Momentum:
(18)  
Nesterov momentum:
(19)  
Appendix B Persistence of Limit Cycles in GD Training
In Figure 5 we portray example Gradient Descent dynamics in the illustrative example described in Section 3 under multiple adaptations proposed in the literature. We observe that oscillations persist in all such modified GD dynamics, though alleviated by some. We briefly describe the modifications in detail first.
Gradient penalty.
The Wasserstein GAN is based on the idea that the discriminator is approximating all Lipschitz functions of the data. Hence, when training the discriminator we need to make sure that the function has a bounded gradient with respect to . One approach to achieving this is weightclipping, i.e. clipping the weights to lie in some interval. However, the latter might introduce extra instability during training. Gulrajani et al. (2017) introduce an alternative approach by adding a penalty to the loss function of the zerosum game that is essentially the norm of the gradient of with respect to . In particular they propose the following regularized WGAN loss:
where is the distribution of the random vector when and . The expectations in the latter can also be replaced with sample estimates in stochastic variants of the training algorithms.
For our simple example, . Hence, we get the gradient penalty modified WGAN:
(20) 
Hence, the gradient of the modified loss function with respect to remains unchanged, but the gradient with respect to becomes:
(21) 
Momentum.
GD with momentum was defined in Equation (18). For the case of the simple illustrative example, these dynamics boil down to:
(22)  
Nesterov momentum.
GD with Nesterov’s momentum was defined in Equation (19). For the illustrative example, we see that Nesterov’s momentum is identical to momentum in the absence of gradient penalty. The reason being that the function is bilinear. However, with a gradient penalty, Nesterov’s momentum boils down to the following update rule.
(23)  
Asymmetric training.
Another approach to reducing cycling is to train the discriminator more frequently than the generator. Observe that if we could exactly solve the supremum problem of the discriminator after every iteration of the generator, then the generator would be simply solving a convex minimization problem and GD should converge pointwise. The latter approach could lead to slow convergence given the finiteness of samples in the case of stochastic training. Hence, we cannot really afford completely solving the discriminators problem. However, training the discriminator for multiple iterations, brings the problem faced by the generator closer to convex minimization rather than solving an equilibrium problem. Hence, asymmetric training could help with cycling. We observe below that asymmetric training is the most effective modification in reducing the range of the cycles and hence making the lastiterate be close to the equilibrium. However, it does not really eliminate the cycles, rather it simply makes their range smaller.
Appendix C Another Example: Learning a CoVariance Matrix
We demonstrate the benefits of using OMD over GD in another simple illustrative example. In this case, the example is does not boil down to a bilinear game and therefore, the simulation results portray that the theoretical results we provided for bilinear games, carry over qualitatively beyond the linear case.
Consider the case where the data distribution is a mean zero multivariate normal with an unknown covariance matrix, i.e., . We will consider the case where the discriminator is the set of all quadratic functions:
(24) 
The generator is a linear function of the random input noise , of the form:
(25) 
The parameters and are both matrices. The WGAN game loss associated with these functions is then:
(26) 
Expanding the latter we get:
Given that the covariance matrix is symmetric positive definite, we can write it as . Then the loss simplifies to:
(27) 
The equilibrium of this game is for the generator to choose for all , and for the discriminator to pick . For instance, in the case of a single dimension we have , where is the variance of the Gaussian. Hence, the equilibrium is for the generator to pick and the discriminator to pick .
Dynamics without sampling noise.
For the mean GD dynamics the update rules are as follows:
(28)  
We can write the latter updates in a simpler matrix form:
(GD for Covariance)  
Similarly the OMD dynamics are:
(OMD for Covariance)  
Due to the nonconvexity of the generators problem and because there might be multiple optimal solutions (e.g. if is not strictly positive definite), it is helpful in this setting to also help dynamics by adding regularization to the loss of the game. The latter simply adds an extra at each gradient term for the discriminator and a at each gradient term for the generator. In Figures 7 and 6 we give the weights and the implied covariance matrix of the generator’s distribution for each of the dynamics for an example setting of the stepsize and regularization parameters and for two and three dimensional gaussians respectively. We again see how OMD can stabilize the dynamics to converge pointwise.
Stochastic dynamics.
In Figure 8 and 9 we also portray the instability of GD and the robustness of the stability of OMD under stochastic dynamics. In the case of stochastic dynamics the gradients are replaced with unbiased estimates or with averages of unbiased estimates over a small minibatch. In the case of a minibatch of one, the unbiased estimates of the gradients in this setting take the following form:
(Stochastic Gradients)  
where are samples drawn from the true distribution and from the random noise distribution respectively. Hence, the stochastic dynamics simply follow by replacing gradients with unbiased estimates:
(SGD for Covariance)  
(SOMD for Covariance)  








Appendix D Last Iterate Convergence of OMD in Bilinear Case
The goal of this section is to show that Optimistic Mirror Descent exhibits last iterate convergence to minmax solutions for bilinear functions. In Section D.1, we provide the proof of Theorem 1, that OMD exhibits last iterate convergence to minmax solutions of the following minmax problem
(29) 
where is an abitrary matrix and and are unconstrained. In Section D.2, we state the appropriate extension of our theorem to the general case:
(30) 
d.1 Proof of Theorem 1
As stated in Section 4, for the minmax problem (29) Optimistic Mirror Descent takes the following form, for all :
(31)  
(32) 
where for the above iterations to be meaningful we need to specify .
As stated in Section 4 we allow any initialization , and , and set and , where represents column space. In particular, our initialization means that the first step taken by the dynamics gives and .
Before giving our proof of Theorem 1, we need some further notation. For all , we set:
With this notation, , , , etc.
We also use the notation , for vectors and square matrices . We similarly define the norm notation . Given our notation, we have the following claim, shown in Appendix D.3.
Claim 1.
For all matrices and vectors of the appropriate dimensions:
With our notation in place, we show (through iterated expansion of the update rule), the following lemma, proved in Appendix D.3:
Lemma 2.
We are ready to prove Theorem 1. Its proof is implied by the following stronger theorem, and Corollary 7.