Reducing Noise in GAN Training with Variance Reduced Extragradient
Reducing Noise in GAN Training with Variance Reduced Extragradient
Abstract
Using large minibatches when training generative adversarial networks (GANs) has been recently shown to significantly improve the quality of the generated samples. This can be seen as a simple but computationally expensive way of reducing the noise of the gradient estimates. In this paper, we investigate the effect of the noise in this context and show that it can prevent the convergence of standard stochastic game optimization methods, while their respective batch version converges. To address this issue, we propose a variancereduced version of the stochastic extragradient algorithm (SVRE). We show experimentally that it performs similarly to a batch method, while being computationally cheaper, and show its theoretical convergence, improving upon the best rates proposed in the literature. Experiments on several datasets show that SVRE improves over baselines. Notably, SVRE is the first optimization method for GANs to our knowledge that can produce near stateoftheart results without using adaptive stepsize such as Adam.
equal*
Tatjana Chavdarovaequal,mila,idiap \icmlauthorGauthier Gidelequal,mila \icmlauthorFrançois Fleuretidiap \icmlauthorSimon LacosteJulienmila,cifar
milaMila & DIRO, Université de Montréal \icmlaffiliationidiapÉcole Polytechnique Fédérale de Lausanne and Idiap Research Institute \icmlaffiliationcifarCIFAR fellow, Canada CIFAR AI chair
Tatjana Chavdarova and Gauthier Gidelfirstname.lastname@umontreal.ca
Machine Learning, ICML
1 Introduction
The current success of largescale machine learning algorithms in large part relies on incremental gradientbased optimization methods to minimize empirical losses. These iterative methods handle large training datasets by computing estimates of the said gradient on minibatches instead of using all the samples at every step, resulting in a method called stochastic gradient descent (SGD) (Robbins and Monro, 1951; Bottou, 2010).
While this method is reliably very efficient for classical loss minimization, such as crossentropy for classification or squared loss for regression, recent works go beyond that setup, and aim at making several agents interact together with competing objectives. The associated optimization paradigm requires a multiobjective joint minimization.
One very popular class of models in that family are the generative adversarial networks (GANs, Goodfellow et al., 2014), which aim at finding a Nash equilibrium of a twoplayer minimax game, where the players are deep neural networks (DNNs).
1.1 Failure of SGD on multiobjective problems
Due to their success on supervised tasks, SGD based algorithms have been adopted for GAN training as well. However, convergence failures, poor performance (sometimes referred to as “mode collapse”), or hyperparameter susceptibility are much more commonly reported compared to classical supervised DNN optimization.
Recent works (Li et al., 2018; Gidel et al., 2019) argue that the currently used firstorder methods may fail to converge on simple examples. Gidel et al. (2019) proposed instead to use an optimization technique coming from the variational inequality literature called extragradient (Korpelevich, 1976) with provable convergence guarantees to optimize games. However, we argue that the multiobjective minimization formulation of GANs introduces new challenges in terms of optimization. For example, we point out that the noise due to the stochasticity may break standard optimization techniques for stochastic games such as the stochastic extragradient method, by providing an example of stochastic bilinear game for which it provably does not converge.
This theoretical consideration is further supported empirically by the fact that using larger minibatch sizes for GAN training has been shown to considerably improve the quality of the produced samples of the resulting generative model. More precisely, Brock et al. (2019) report a relative improvement of of the Inception Score metric (see § 5) on ImageNet if the batch size is increased –fold. Nevertheless, this comes at the cost of an increase in the required computational budget that is prohibitively expensive for most of academic researchers in the machine learning community.
1.2 Our contributions
In this paper, we investigate the interplay between noise and multiobjective problems in the context of GAN training, and propose the novel “stochastic variance reduced extragradient” (SVRE) algorithm.
Our contributions can be summarized as follows:

We show in a motivating example how the noise can make standard stochastic extragradient fail (see § 3.1).

We propose a new method that combines variance reduction and extrapolation (SVRE) and show experimentally that SVRE effectively reduces the noise on realworld datasets (see § 4.2).

We demonstrate experimentally the performance of SVRE to train GANs on MNIST, SVHN, ImageNet, and CIFAR10 with fixed stepsizes. To our knowledge, SVRE is the first optimization method that can produce near stateoftheart results without using adaptive stepsize such as Adam (Kingma and Ba, 2015) (see § 5).
Method  Complexity  adaptivity  

SVRG  ✗  
Acc. SVRG  ✗  

✓ 
2 GANs as a game
The models/players in a GAN are respectively a generator , that maps an embedding space to the signal space, and should eventually map a fixed noise distribution to the training data distribution, and a discriminator whose only role is to allow the training of the generator by classifying genuine samples against generated ones. The central idea is that as long as is able to do better than random, is not properly modeling the data.
In this algorithm, at each iteration of the stochastic gradient descent (SGD), the discriminator is updated to improve its “real vs. generated” classification performance, and the generator is updated to degrade it.
Game theory formulation of GANs
From a game theory point of view, GAN training is a differentiable twoplayer game: the discriminator aims at minimizing its cost function and the generator aims at minimizing its own cost function . Using the same formulation as the one in Mescheder et al. (2017) and Gidel et al. (2019), the GAN objective has the following form,
(2PG) 
When this game is called a zerosum game and (2PG) can be formulated as a minimax problem:
(SP) 
The gradient method is known to not converge for some convexconcave examples (Gidel et al., 2019). To fix this issue, Korpelevich (1976) proposed to use the extragradient method^{3}^{3}3For simplicity of presentation, we describe the algorithms for the unconstrained setting where . In the constrained scenario, a Euclidean projection on the constraint set should be added at every update of the extragradient method. This more general version (25) is the one analyzed in the appendix.:
(EG)  
This method performs a lookahead step in order to get signal from an extrapolated point, damping the oscillations.
It can be shown that, in the context of a zerosum game, for any convexconcave function and any closed convex sets and , the extragradient method does converge. We give below such a convergence result without rates for simplicity.
3 Noise in Games
In standard large scale machine learning applications such as GANs, one cannot afford to compute the full batch gradient of the objective function at each timestep because of the too large batch size. The standard way to cope with this issue is to sample minibatches of reasonable size and to use the gradient estimated on each minibatch as an unbiased estimator of the “full” gradient. Unfortunately, the resulting noise in the gradient estimate may interact with the oscillations due to the adversarial component of the game.
We illustrate this phenomenon in Fig. 1 by contrasting the direction given by the noisy gradient on the following game and minimization problem, respectively:
(1) 
Since the batch version of the gradient methods (updating both players simultaneously or alternatively) fails to converge for some convex (bilinear) games, there is no hope of convergence for their respective stochastic version. However, since (EG) does converge (Thm. 1), we could reasonably expect that its stochastic version does too (at least to a neighborhood). In the following section, we show that even under reasonable assumption this assertion is false: We present a simple example where the extragradient method does converge linearly (Gidel et al., 2019, Corollary 1) using the full gradient but diverges geometrically when using a stochastic estimate of it.^{4}^{4}4On this example, standard gradient methods diverge.
3.1 Stochasticity Breaks Extragradient
In the following, we show that, (a) on one hand, if we use standard stochastic estimates of the gradients of with a simple finite sum formulation, then the iterates produced by the stochastic extragradient method (SEG) diverge geometrically, and (b) on the other hand, Theorem 1 ensures that the fullbatch extragradient method does converge to the Nash equilibrium of this game. All detailed proofs can be found in § B.
Theorem 2 (Noise may induce divergence).
There exists a zerosum stochastic game such that if , then for any stepsize , the iterates computed by the stochastic extragradient method diverge geometrically, i.e., there exists such that .
Proof sketch.
We consider the following stochastic optimization problem,
(2) 
where if and otherwise. Note that this problem is a simple dot product between and , thus we can compute the batch gradient and notice that the Nash equilibrium of this problem is . However, as we will see below, this simple problem actually breaks standard stochastic optimization methods.
Sampling a minibatch without replacement , we denote . The extragradient update rule can be written as
(3) 
where and are the minibatches respectively sampled for the update and the extrapolation step. Let us write . Noticing that if and otherwise, we have,
Consequently, if the minibatch size is smaller than half of the dataset size, i.e. , we have that . ∎
This result may seem contradictory with the standard result on SEG (Juditsky et al., 2011) saying that the average of the iterates computed by SEG does converge to the Nash equilibrium of the game. But one fundamental assumption made by Juditsky et al. is that the estimator of the gradient has a finite variance. This assumption breaks in this example since the variance of the estimator is proportional to the norm of the parameters.
Thus, constraining the optimization problem (2) to bounded domains and ,
(4) 
would make the finite variance assumption from Juditsky et al. (2011) hold. Consequently, the averaged iterate would converge to . However we argue in the next section that such result may not be satisfying for nonconvex problems.
3.2 Why is convergence of the last iterate preferable?
In the light of Theorem 2, the behavior of the iterates on the constrained problem (4) is the following: they will diverge until they reach the boundary of and and then they will start to turn around the Nash equilibrium of (4) lying on these boundaries. Using convexity properties, we can then show that the averaged iterates will converge to the Nash equilibrium of the problem. However, with an arbitrary large domain, this convergence rate may be arbitrary slow (since it depends on the diameter of the domain).
Moreover, this behavior might be even more problematic in a nonconvex framework because even if by chance we initialize close to the Nash equilibrium, we would get away from it and we cannot rely on convexity to expect the average of the iterates to converge.
Consequently, we would like optimization algorithms generating iterates that stay close to the Nash equilibrium.
4 Reducing Noise with VR Methods
One straightforward way to reduce the noise in the estimation of the gradient is to use minibatches of samples instead of just one sample. However, minibatches stochastic extragradient fails to converge on (4) if the minibatch size is smaller than half of the dataset size (see § B.1 for more details). In order to get an estimator of the gradient with a vanishing variance, the optimization literature proposed to take advantage of the finitesum formulation that often appears in machine learning (Schmidt et al., 2017, and references therein).
4.1 Variance Reduced Methods
Motivated by the GAN setup, let us assume that the objective in (2PG) can be decomposed as a finite sum such that,
(5) 
where we denote .
Johnson and Zhang (2013) propose the “stochastic variance reduced gradient” (SVRG) as an unbiased estimator of the gradient with a smaller variance than the vanilla minibatch estimate. The idea is to occasionally take a snapshot of the current model’s parameters, and store the full gradient at this point. Computing the full gradient at is an expensive operation but not prohibitive if it is done infrequently enough (for instance once every pass).
Assuming that we have stored and , unbiased estimates of the respective gradient of and are,
(6)  
(7) 
Actually, , where the expectation is taken over , picked with a probability . We call this unbiased estimate of the gradient the SVRG estimate. SVRG is an epoch based algorithm where an epoch is the inner loop which updates incrementally the parameters using the SVRG estimate between two snapshots which are updated in the outer loop.
The nonuniform sampling probabilities are used to balance samples in spite of large variations of gradient estimates Lipschitz constant. This strategy has been first introduced for variance reduced methods by Xiao and Zhang (2014) for SVRG, and has been discussed for saddle point optimization by Palaniappan and Bach (2016).
Originally, SVRG was introduced as an epoch based algorithm with a fixed epoch size: in Alg. 1, one epoch is an inner loop of size (Line 7). However, Hofmann et al. (2015) proposed instead to sample the size of each epoch from a geometric distribution. Doing so Hofmann et al. defined a notion of memorization algorithm that unifies SAGA, SVRG and other variants of variance reduction for incremental gradient methods with similar convergence rates. We generalize their notion of memorization algorithm to handle the extrapolation step (EG) and provide a convergence proof for such memorization algorithms in § B.2.
One advantage of Hofmann et al. (2015)’s framework is also that the sampling of the epoch size does not depend on the condition number of the problem, whereas the original proof for SVRG had to consider an epoch size larger than the condition number (see Leblond et al. (2018, Corollary 16) for a detailed discussion on the convergence rate for SVRG). Thus, this new version of SVRG with a random epoch size becomes adaptive to the local strong convexity since none of its hyperparameters depend on the strong convexity constant.
However, because of some technical aspects introduced with monotone operators, Palaniappan and Bach (2016)’s proofs (both for SAGA and SVRG) require a stepsize that depends on the strong monotonicity constant making these algorithms not adaptive to local strong monotonicity. This motivates the proposed SVRE algorithm, which is adaptive to local strong monotonicity, and is thus more appropriate for nonconvex optimization.
4.2 SVRE: Variance Reduced Extragradient
We describe our proposed algorithm called stochastic variance reduced extragradient (SVRE) in Alg. 1. In an analogous manner that Palaniappan and Bach (2016) combined SVRG with the gradient method to solve games, SVRE combines SVRG estimates of the gradient (67) with the extragradient method (EG). While the algorithmic proposal is simple, the proof of convergence is nontrivial. Moreover, with this method we are able to improve the best known convergence rates for variance reduction method for stochastic games (Table 2 and Thm. 3), and we show in § 4.3 that it is the only method which empirically converges on the simple example of § 3.1. We now describe the theoretical setup for the convergence result.
A standard assumption in convex optimization is the assumption of strong convexity of the function. However, in a game, the operator,
(8) 
associated with the updates is not anymore the gradient of a single function. In order to make a similar assumption in the context of games, the optimization literature considers the notion of strong monotonicity.
Definition 1.
is said to be strongly monotone if ,
where we write . A monotone operator is a strongly monotone operator.
This definition is a generalization of strong convexity for operators: if is strongly convex, then is a monotone operator. Note that, in this definition, we used the same weighted Euclidean norm
(9) 
as Palaniappan and Bach (2016). They point out that this rescaling of the Euclidean norm is crucial in order to balance the respective players’ objective and getting better constants in the convergence result.
Assumption 1.
For , the functions and are respectively and smooth and the associated game operator (8) is strongly monotone.
A function is said to be smooth if its gradient is Lipschitz. Under this smoothness assumption on each cost function of the game operator, we can define a smoothness constant adapted to the nonuniform sampling scheme of our stochastic algorithm:
(10) 
The standard uniform sampling scheme corresponds to and the optimal nonuniform sampling scheme corresponds to . We always have the bounds:
(11) 
We now present our convergence result for SVRE with nonuniform sampling (to make our constants more comparable to those of Palaniappan and Bach (2016)), but note that we have used uniform sampling in all our experiments (for simplicity).
Theorem 3.
We prove this theorem in § B.2. We can notice that only the respective condition numbers of and defined as and appear in our convergence rate. The convergence rate of the (nonaccelerated) algorithm of Palaniappan and Bach (2016) depends on the product of these condition numbers . They avoid a dependence on the maximum of the condition numbers squared by using the weighted Euclidean norm defined in (9) and rescaling the functions and with their strongmonotonicity constant. However, this rescaling trick suffers from two issues: {enumerate*}[series = tobecont, itemjoin = , label=()]
We do not know in practice a good estimate of the strong monotonicity constant, which was not the case in Palaniappan and Bach (2016)’s application.
The algorithm does not adapt to (larger) local strongmonotonicity. This property is fundamental in nonconvex optimization since we want the algorithm to exploit the (potential) local stability properties of a stationary point.
4.3 Motivating example
The example presented in (2) seems to be a challenging problem in the stochastic setting since even the stochastic extragradient method fails to find the Nash equilibrium of this game. This problem is actually an unconstrained version of (Gidel et al., 2019, Fig. 3). To explore how SVRE behaves on this problem, we decided to test its empirical performance. We set , and drew , where if and otherwise. Our optimization problem was:
(12) 
We compared the following algorithms (which are all using uniform sampling): {enumerate*}[series = tobecont, itemjoin = , label=()]
AltSGD: Standard stochastic gradient method alternating the update of each player. This is the standard method to train GANs.
SVRE: The algorithm presented in this paper Alg. 1. The Avg prefix correspond to the uniform average of the iterates, . We observe in Fig. 2 that AvgSVRE converges sublinearly (whereas AvgAltSGD fails to converge). This motivates a new variant of SVRE that is based on the ideas developed in § 3.2, that even if the averaged iterate converge we do not compute the gradient at that point and thus we do not benefit from the fact that this iterate is closer to the optimum. Thus the idea is to occasionally restart the algorithm, i.e., consider the averaged iterate as the new starting point of our algorithm and compute the gradient at that point. Restart goes well with SVRG since we already occasionally stop the inner loop to recompute a new snapshot, we make use of this pause to decide (with a probability to be fixed) whether or not we restart our algorithm taking the snapshot at point instead of . This variant of SVRG is described in Alg. 3 in § D and the variant combining VRAd is described in § C.1.
In Fig. 2 we observe that the only method that converges is SVRE and its variants. Note that we do not provide any convergence guarantees for Alg. 3 and leave its analysis for future work. However, it seems interesting that, to our knowledge, this algorithm is the only stochastic algorithm (excluding batch extragradient since it is not a stochastic algorithm) that converge for (2). Note that we tried all the algorithms presented in Fig. 3 from Gidel et al. (2019) on this unconstrained problem and that all of them diverge.
5 Experiments
Datasets.
To evaluate SVRE, we used the following datasets: {enumerate*}[series = tobecont, itemjoin = , label=()]
MNIST Lecun and Cortes (),
CIFAR10 (Krizhevsky, 2009, §3),
SVHN Netzer et al. (2011), as well as
ImageNet ILSVRC 2012 Russakovsky et al. (2015), using , , , and resolution, respectively.
Metrics.
To evaluate the performance of SVRE, we conducted experiments on image synthesis with GAN and used the Inception score (IS, Salimans et al., 2016) and the Fréchet Inception distance (FID, Heusel et al., 2017) as performance metrics. IS takes into account both the sample “diversity” and how realistic the samples look, by considering the class distribution and prediction confidence of a trained Inception network Szegedy et al. (2015), respectively. On the other hand, FID compares the synthetic images with those of the training dataset, by embedding large samples of the two in a lower dimensional space, using trained Inception network. To gain insights if SVRE indeed reduces the variance of the gradient estimates (over the iterations) we used the second moment estimate (uncentered variance, herein denoted as SME), computed with an exponentially moving average. SME is computed for each parameter as done in Adam Kingma and Ba (2015), and finally, we plot the averaged value over all the parameters. We refer the reader to § E for details on these metrics.
Optimization methods.
We conduct experiments using the following optimization methods for GANs: {enumerate*}[series = tobecont, itemjoin = , label=()]
BatchE: full–batch extragradient,
SE: stochastic extragradient, and
SVRE: stochastic variance reduced extragradient. These can be combined with adaptive learning rate methods such as Adam, or with parameter averaging and warmstart of it, hereafter denoted as –A, AVG– and WA–, respectively. With BatchE–A and SE–A we denote respectively full–batch and stochastic extragradient that use Adam. In § C.1, we present a variant of Adam adapted to variance reduced algorithms, that is referred as VRAd. All our experiments use minibatching with uniform probabilities; the minibatch version of SVRE for GANs is described in more details in § C.2.
DNN architectures.
For experiments on MNIST, we use the DCGAN (Radford et al., 2016) architecture. For the remaining datasets, we used the SAGAN (Zhang et al., 2018) architecture, as it is used by Brock et al. (2019) demonstrating stateofart performances on ImageNet. The focus of our experiments is to compare SVRE with baselines in a normalized and realistic setting. However, we lack the needed computational resources to do the extensive hyperparameter tuning typically required to obtain new stateoftheart results with the latest architectures. In particular, for efficiency, we used architectues of approximately half of the depth of the CIFAR10 architectures listed in Miyato et al. (2018) (see § E for the details on our implementation).
Inception Score  Fréchet Inception Distance  

SE–A  SVRE  SVRE–V  SE–A  SVRE  SVRE–V  
MNIST  
CIFAR10  
SVHN  
ImageNet 
5.1 Results
We conducted experiments on the MNIST common benchmark, for which fullbatch extragradient is feasible to compute. Fig. 3 depicts the IS metrics while using either a stochastic, fullbatch or variance reduced version of extragradient (see details of SVREGAN in § C.2). For the stochastic baseline (SE), we always use the Adam optimization method, as originally proposed (Gidel et al., 2019). From Fig. 2(a), we observe that in terms of the number of parameter updates, SVRE performs similarly to BatchEA. Note, however, that the latter requires significantly more computation: Fig. 2(b) depicts the IS metric using the number of minibatch computations as xaxis (a surrogate for the wallclock time, see discussion below). We observe that, as SEA has slower periteration convergence rate, SVRE converges faster on this set of experiments.
Computational cost.
The relative cost of one pass over the dataset for SVRE versus vanilla SGD is a factor of : the full batch gradient is computed (on average) after one pass over the dataset, giving a slowdown of ; the factor takes into account the extra stochastic gradient computations for the variance reduction, as well as the extrapolation step overhead. Note that some training methods for GANs have similar overhead, e.g. how Arjovsky et al. (2017) used discriminator updates per generator update for training WGAN. Moreover, as SVRE provides less noisy gradient, it may converge faster per iteration, compensating the extra perupdate cost. In Fig. 2(b), the xaxis uses an implementationindependent surrogate for wallclock time that counts the number of minibatch gradient computations, which is the bottleneck step.
Second moment estimate and Adam.
Fig. 2(c) depicts the averaged secondmoment estimate for parameters of the Generator, where we observe that SVRE effectively reduces it over the iterations. The reduction of these values may be the reason why Adam combined with SVRE performs poorly (as these values appear in the denominator of Adam, see § C.1). Nonetheless, SVRE produces good results on several high dimensional datasets. To our knowledge, SVRE is the first optimization method with a constant step size that has worked empirically for GANs on nontrivial datasets.
Comparison on larger datasets.
In Fig. 4 & 5, we compare SVRE with the SEA baseline on the SVHN and ImageNet datasets. We observe that although SE–A in some experiments obtains better performances in the early iterations (starts to converge faster), SVRE allows for obtaining better final performances. In Tab. 2 we summarize the best scores obtained on MNIST, CIFAR10, SVHN and ImageNet. Note that our experimental part was devised to investigate if SVRE has potential advantages in a realistic GAN setup, and we did not perform exhaustive hyperparameter search and multiple runs with different random seeds, due to a lack of computational resources. § F lists additional experimental analyses.
Comparison using deeper architectures on CIFAR10 and SVHN.
We report here our preliminary results using the ResNet (He et al., 2015) architecture (see § E.2) used for CIFAR10 and SVHN in Miyato et al. (2018), on these two datasets as a reference. We compare SE–A and SVRE after K iterations of training, and we obtain FID scores of: {enumerate*}[series = tobecont, itemjoin = , label=()]
SVHN: and for SE–A and SVRE, resp.;
CIFAR10: , for SE–A and SVRE, resp. (see § F for details). We observe that these experiments are less stable, and a fair comparison requires more exhaustive hyperparameter search. More precisely, SE–A diverged in % of the experiments, while using ratio of of updates of G and D, resp. improved the stability. On the other hand, SVRE did not diverge in any of the experiments, but on deeper architectures we observe that SVRE takes longer time to converge compared to SE–A. Moreover, warmstarting from the baseline of after K iterations and then continuing by using SVRE, we obtain a FID of . This could indicate that SVRE may largely benefit if combined with adaptive step size method, what we leave for future work.
6 Related work
Nowadays, two main stochastic variance reduction algorithm considered in the literature are SAGA (Defazio et al., 2014) and SVRG (Johnson and Zhang, 2013). The former is memory based, storing the latest stochastic gradient seen for each sample whereas the latter is epoch based, computing a full batch gradient at the beginning of each epoch. Interestingly, Hofmann et al. (2015) provides a randomized version (for the epoch length) of SVRG under a framework unifying SVRG and SAGA, and this is under that framework that we provide our analysis, thus proving at the same time for SVRE and SAGA with extragradient.
Surprisingly, there exist only a few works on variance reduction methods for monotone operators: (Palaniappan and Bach, 2016) and (Davis, 2016). As mentioned by Palaniappan and Bach (2016), the latter requires a cocoercivity assumption on the operator and thus only convex optimization is considered. Our work provides a new way to use variance reduction for monotone operators, using the extragradient method (Korpelevich, 1976). Recently, Iusem et al. (2017) proposed an extragradient method with variance reduction for an infinite sum of operators. They use minibatches of growing size in order to reduce the variance of their algorithm and to converge with a constant stepsize. However, this growing minibatch size is prohibitively expensive in our application. Moreover, they are not using the SAGA/SVRG style of updates exploiting the finite sum formulation, leading to sublinear convergence rate, while our method benefits from a linear convergence rate exploiting the finite sum assumption.
Daskalakis et al. (2018) proposed a method called OptimisticAdam inspired by game theory. This method is closely related to extragradient, with slightly different update scheme. More recently, Gidel et al. (2019) proposed to use extragradient to train GANs, introducing a method called ExtraAdam. This method outperformed OptimisticAdam when trained on CIFAR10. Our work is also an attempt to find principled ways to train GANs. Considering that the game aspect is better handled by the extragradient method, we focus on the optimization issues arising from the noise in the training procedure, an underconsidered potential issue in GAN training.
Noise in Deep Learning and Games.
In the context of deep learning, despite some very interesting theoretical results on nonconvex minimization (Reddi et al., 2016; AllenZhu and Hazan, 2016), the effectiveness of variance reduced methods is still an open question, and a recent technical report by Defazio and Bottou (2018) provides negative empirical results on the variance reduction aspect.
In addition, a recent large scale study (Shallue et al., 2018) tried to increase the batch size is a simple way and observed a marginal impact on single objective training, while in contrast it had an effect on GAN results (Brock et al., 2019). In our work we are able to show positive results for variance reduction in a real world deep learning setting. This surprising difference seems to confirm the remarkable discrepancy, that remain poorly understood, between multiobjective optimization and standard minimization.
7 Conclusion
We considered a bilinear game optimization and despite its simplicity compared to realworld GAN optimization problems, we show that stochasticity breaks its convergence. We proposed the stochastic variance reduced extragradient method that combines SVRG with the extragradient method for optimizing games. On the theory side, SVRE improves upon the previous best results for stronglyconvex games, whereas empirically, SVRE is the only method that converges for the bilinear game.
Our results showed that SVRE obtained empirically similar convergence speed to BatchExtragradient on MNIST, with the latter being computationally infeasible for larger datasets. For standard size convolutional neural networks, SVRE showed improved final performances in 2 out of 4 datasets.
Our preliminary experiments with deeper architectures show that SVRE is notably more stable in terms of the choice of hyperparameters, as its stochastic counterpart diverged in all of our experiments, whereas SVRE did not. However, we observe that SVRE takes more iterations to converge when using deeper architectures, suggesting that SVRE can be further extended with adaptive step size method. Considering the notable instabilities when using such architectures and standard optimization methods, our results suggest variance reduction methods as necessary for improving GAN training, unlike single objective optimization.
Acknowledgements
This research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence Research Chair in “Data Science for Realtime Decisionmaking”, by the NSERC Discovery Grant RGPIN201706936 and by a Google Focused Research award. Authors would like to thank Compute Canada for providing the GPUs used for this research.
References
 AllenZhu and Hazan (2016) Z. AllenZhu and E. Hazan. Variance reduction for faster nonconvex optimization. In ICML, 2016.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 Bottou (2010) L. Bottou. Largescale machine learning with stochastic gradient descent. In COMPSTAT, 2010.
 Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Brock et al. (2019) A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
 Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
 Davis (2016) D. Davis. Smart: The stochastic monotone aggregated rootfinding algorithm. arXiv:1601.00698, 2016.
 Defazio and Bottou (2018) A. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deep learning. arXiv:1812.04529, 2018.
 Defazio et al. (2014) A. Defazio, F. Bach, and S. LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, 2014.
 Gidel et al. (2019) G. Gidel, H. Berard, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial nets. In ICLR (to appear), 2019.
 Glorot and Bengio (2010) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 Harker and Pang (1990) P. T. Harker and J.S. Pang. Finitedimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, 1990.
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
 Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, 2017.
 Hofmann et al. (2015) T. Hofmann, A. Lucchi, S. LacosteJulien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In NIPS, 2015.
 Iusem et al. (2017) A. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.
 Johnson and Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.
 Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 2011.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Korpelevich (1976) G. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 1976.
 Krizhevsky (2009) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, 2009.
 Leblond et al. (2018) R. Leblond, F. Pederegosa, and S. LacosteJulien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. JMLR, 19(81):1–68, 2018.
 (24) Y. Lecun and C. Cortes. The MNIST database of handwritten digits. URL http://yann.lecun.com/exdb/mnist/.
 Li et al. (2018) J. Li, A. Madry, J. Peebles, and L. Schmidt. On the limitations of first order approximation in GAN dynamics. In ICML, 2018.
 Lim and Ye (2017) J. H. Lim and J. C. Ye. Geometric GAN. arXiv:1705.02894, 2017.
 Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.
 Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 Netzer et al. (2011) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. URL http://ufldl.stanford.edu/housenumbers/.
 Palaniappan and Bach (2016) B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddlepoint problems. In NIPS, 2016.
 Radford et al. (2016) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 Reddi et al. (2016) S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In ICML, 2016.
 Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 1951.
 Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
 Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
 Schaul et al. (2013) T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In ICML, 2013.
 Schmidt et al. (2017) M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 2017.
 Shallue et al. (2018) C. J. Shallue, J. Lee, J. Antognini, J. SohlDickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv:1811.03600, 2018.
 Szegedy et al. (2015) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015.
 Wilson et al. (2017) A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS, 2017.
 Xiao and Zhang (2014) L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 Zhang et al. (2018) H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. SelfAttention Generative Adversarial Networks. arXiv:1805.08318, 2018.
Appendix A Definitions and Lemmas
a.1 Smoothness and Monotonicity of the operator
Another important property used is the Lipschitzness of an operator.
Definition 2.
A mapping is said to be Lipschitz if,
(13) 
Definition 3.
A differentiable function is said to be strongly convex if
(14) 
Definition 4.
A function is said convexconcave if is convex for all and is concave for all . An is said to be strongly convex concave if is convex concave.
Definition 5.
For , an operator is said to be strongly monotone if ,
where we noted .
Appendix B Proof of Theorems
b.1 Proof of Theorem 2
Proof.
Let us consider the optimization problem,
(15) 
where if and otherwise. Note that for . Let us consider the extragradient method where to compute an unbiased estimator of the gradients at we sample and use as estimator of the vector flow.
In this proof we note, and the vector such that if and otherwise. Note that and that .
Thus the extragradient update rule can be noted as
(16) 
where is the minibatch (without replacement) sampled for the update and the minibatch (without replacement) sampled for the extrapolation.
We can thus notice that, when , we have
(17) 
and otherwise,
(18) 
The intuition is that, on one hand, when (which happens with high probability when , e.g., when , ), the algorithm performs an update that get away from the Nash equilibrium:
(19) 
where and . On the other hand, The ”good” updates only happen when is large (which happen with low probability, e.g., when , ):
(20) 
Conditioning on and , we get that
(21) 
Leading to,
(22) 
Plugging these expectations in (20), we get that,
(23) 
Consequently,
(24) 
Thus, converge if and only if which is not possible when .
To sumup, if is not large enough (more precisely if ), we have the geometric divergence of the quantity . ∎
b.2 Proof of Theorem 3
Setting of the Proof.
We will prove a slightly more general result than Theorem 3. We will work in the context of monotone operator. Let us consider the general extrapolation update rule,
(25) 
where depends on and depends on . Here denotes the Euclidean projection operator on , i.e. .
Let us first state a lemma standard in convex analysis (see for instance [Boyd and Vandenberghe, 2004]),
Lemma 1.
Let and then for all we have,
(26) 
Proof of Lemma 1.
We start by simply developing,
Then since is the projection onto the convex set of we have that , leading to the result of the Lemma. ∎
Lemma 2.
If is strongly monotone for any we have,
(27) 
where we noted .
Proof.
By strong monotonicity and optimality of ,
(28)  
(29) 
and then we use the inequality to get the result claimed. ∎
Using this update rule we can thus deduce the following lemma, the derivation of this lemma is very similar from the derivation of Harker and Pang [1990, Lemma 12.1.10].
Lemma 3.
Considering the update rule (25), we have for any and any ,
(30) 
Proof.
Applying Lem. 1 for and , we get,