Reducing Noise in GAN Training with Variance Reduced Extragradient

Reducing Noise in GAN Training with Variance Reduced Extragradient

Reducing Noise in GAN Training with Variance Reduced Extragradient

Abstract

Using large mini-batches when training generative adversarial networks (GANs) has been recently shown to significantly improve the quality of the generated samples. This can be seen as a simple but computationally expensive way of reducing the noise of the gradient estimates. In this paper, we investigate the effect of the noise in this context and show that it can prevent the convergence of standard stochastic game optimization methods, while their respective batch version converges. To address this issue, we propose a variance-reduced version of the stochastic extragradient algorithm (SVRE). We show experimentally that it performs similarly to a batch method, while being computationally cheaper, and show its theoretical convergence, improving upon the best rates proposed in the literature. Experiments on several datasets show that SVRE improves over baselines. Notably, SVRE is the first optimization method for GANs to our knowledge that can produce near state-of-the-art results without using adaptive step-size such as Adam.

\icmlsetsymbol

equal*

{icmlauthorlist}\icmlauthor

Tatjana Chavdarovaequal,mila,idiap \icmlauthorGauthier Gidelequal,mila \icmlauthorFrançois Fleuretidiap \icmlauthorSimon Lacoste-Julienmila,cifar

\icmlaffiliation

milaMila & DIRO, Université de Montréal \icmlaffiliationidiapÉcole Polytechnique Fédérale de Lausanne and Idiap Research Institute \icmlaffiliationcifarCIFAR fellow, Canada CIFAR AI chair

\icmlcorrespondingauthor

Tatjana Chavdarova and Gauthier Gidelfirstname.lastname@umontreal.ca

\icmlkeywords

Machine Learning, ICML


\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

The current success of large-scale machine learning algorithms in large part relies on incremental gradient-based optimization methods to minimize empirical losses. These iterative methods handle large training datasets by computing estimates of the said gradient on mini-batches instead of using all the samples at every step, resulting in a method called stochastic gradient descent (SGD) (Robbins and Monro, 1951; Bottou, 2010).

While this method is reliably very efficient for classical loss minimization, such as cross-entropy for classification or squared loss for regression, recent works go beyond that setup, and aim at making several agents interact together with competing objectives. The associated optimization paradigm requires a multi-objective joint minimization.

One very popular class of models in that family are the generative adversarial networks (GANs, Goodfellow et al., 2014), which aim at finding a Nash equilibrium of a two-player minimax game, where the players are deep neural networks (DNNs).

1.1 Failure of SGD on multi-objective problems

Due to their success on supervised tasks, SGD based algorithms have been adopted for GAN training as well. However, convergence failures, poor performance (sometimes referred to as “mode collapse”), or hyperparameter susceptibility are much more commonly reported compared to classical supervised DNN optimization.

Recent works (Li et al., 2018; Gidel et al., 2019) argue that the currently used first-order methods may fail to converge on simple examples. Gidel et al. (2019) proposed instead to use an optimization technique coming from the variational inequality literature called extragradient (Korpelevich, 1976) with provable convergence guarantees to optimize games. However, we argue that the multi-objective minimization formulation of GANs introduces new challenges in terms of optimization. For example, we point out that the noise due to the stochasticity may break standard optimization techniques for stochastic games such as the stochastic extragradient method, by providing an example of stochastic bilinear game for which it provably does not converge.

This theoretical consideration is further supported empirically by the fact that using larger mini-batch sizes for GAN training has been shown to considerably improve the quality of the produced samples of the resulting generative model. More precisely, Brock et al. (2019) report a relative improvement of of the Inception Score metric (see § 5) on ImageNet if the batch size is increased –fold. Nevertheless, this comes at the cost of an increase in the required computational budget that is prohibitively expensive for most of academic researchers in the machine learning community.

1.2 Our contributions

In this paper, we investigate the interplay between noise and multi-objective problems in the context of GAN training, and propose the novel “stochastic variance reduced extragradient” (SVRE) algorithm.

Our contributions can be summarized as follows:

  • We show in a motivating example how the noise can make standard stochastic extragradient fail (see § 3.1).

  • We propose a new method that combines variance reduction and extrapolation (SVRE) and show experimentally that SVRE effectively reduces the noise on real-world datasets (see § 4.2).

  • We prove the convergence of SVRE under local strong convexity assumptions, improving over the best rates proposed in the literature (see § 4.2 and Table 2).

  • We demonstrate experimentally the performance of SVRE to train GANs on MNIST, SVHN, ImageNet, and CIFAR-10 with fixed step-sizes. To our knowledge, SVRE is the first optimization method that can produce near state-of-the-art results without using adaptive step-size such as Adam (Kingma and Ba, 2015) (see § 5).

Method Complexity -adaptivity
SVRG
Acc. SVRG
SVRE
Table 1: Comparison of SVRE (Alg. 1) with other variance reduced methods for games for a -strongly monotone operator with -Lipschitz stochastic estimators (see (10) for the definition of ). We consider that the computation of one stochastic gradient is . SVRG and Accelerated SVRG for monotone operators (a generalization of games) have been proposed by Palaniappan and Bach (2016, Thm. 2 & Thm. 3).222Note that the complexity results expressed in their Table 1 focus on a more specific bilinear problem analyzed in terms of the row and column-dimension of a matrix (the product of these dimensions correspond to our ). We focus instead on the general finite sum setting, using their Thm. 2 & 3.  The column -adaptivity indicates whether the algorithm’s hyper-parameters (step size & epoch length) that guarantee convergence depend on the strong monotonicity parameter : if not, then the algorithm is adaptive to local strong monotonicity, a useful property to apply these results in the non-convex setting (see discussion after Thm. 3).

2 GANs as a game

The models/players in a GAN are respectively a generator , that maps an embedding space to the signal space, and should eventually map a fixed noise distribution to the training data distribution, and a discriminator whose only role is to allow the training of the generator by classifying genuine samples against generated ones. The central idea is that as long as is able to do better than random, is not properly modeling the data.

In this algorithm, at each iteration of the stochastic gradient descent (SGD), the discriminator is updated to improve its “real vs. generated” classification performance, and the generator is updated to degrade it.

Game theory formulation of GANs

From a game theory point of view, GAN training is a differentiable two-player game: the discriminator aims at minimizing its cost function and the generator aims at minimizing its own cost function . Using the same formulation as the one in Mescheder et al. (2017) and Gidel et al. (2019), the GAN objective has the following form,

(2P-G)

When this game is called a zero-sum game and (2P-G) can be formulated as a minimax problem:

(SP)

The gradient method is known to not converge for some convex-concave examples (Gidel et al., 2019). To fix this issue, Korpelevich (1976) proposed to use the extragradient method333For simplicity of presentation, we describe the algorithms for the unconstrained setting where . In the constrained scenario, a Euclidean projection on the constraint set should be added at every update of the extragradient method. This more general version (25) is the one analyzed in the appendix.:

(EG)

This method performs a lookahead step in order to get signal from an extrapolated point, damping the oscillations.

It can be shown that, in the context of a zero-sum game, for any convex-concave function and any closed convex sets and , the extragradient method does converge. We give below such a convergence result without rates for simplicity.

Theorem 1 (Theorem 12.1.11, Harker and Pang, 1990).

Let and be two closed convex sets, let a -smooth function such that be convex for any and let be concave for any . If , then the sequence generated by the extragradient method (EG) converges to a solution of (SP).

3 Noise in Games

Figure 1: Illustration of the discrepancy between games and minimization on simple examples (1). Left: minimization. Up to a neighborhood, the noisy gradient always points to a direction that make the iterate closer to the minimum (). Right: game. The noisy gradient may point to a direction (red arrow) that push the iterate away from the Nash Equilibrium ().

In standard large scale machine learning applications such as GANs, one cannot afford to compute the full batch gradient of the objective function at each time-step because of the too large batch size. The standard way to cope with this issue is to sample mini-batches of reasonable size and to use the gradient estimated on each mini-batch as an unbiased estimator of the “full” gradient. Unfortunately, the resulting noise in the gradient estimate may interact with the oscillations due to the adversarial component of the game.

We illustrate this phenomenon in Fig. 1 by contrasting the direction given by the noisy gradient on the following game and minimization problem, respectively:

(1)

Since the batch version of the gradient methods (updating both players simultaneously or alternatively) fails to converge for some convex (bilinear) games, there is no hope of convergence for their respective stochastic version. However, since (EG) does converge (Thm. 1), we could reasonably expect that its stochastic version does too (at least to a neighborhood). In the following section, we show that even under reasonable assumption this assertion is false: We present a simple example where the extragradient method does converge linearly (Gidel et al., 2019, Corollary 1) using the full gradient but diverges geometrically when using a stochastic estimate of it.444On this example, standard gradient methods diverge.

3.1 Stochasticity Breaks Extragradient

In the following, we show that, (a) on one hand, if we use standard stochastic estimates of the gradients of with a simple finite sum formulation, then the iterates produced by the stochastic extragradient method (SEG) diverge geometrically, and (b) on the other hand, Theorem 1 ensures that the full-batch extragradient method does converge to the Nash equilibrium of this game. All detailed proofs can be found in § B.

Theorem 2 (Noise may induce divergence).

There exists a zero-sum stochastic game such that if , then for any step-size , the iterates computed by the stochastic extragradient method diverge geometrically, i.e., there exists such that .

Proof sketch.

We consider the following stochastic optimization problem,

(2)

where if and otherwise. Note that this problem is a simple dot product between and , thus we can compute the batch gradient and notice that the Nash equilibrium of this problem is . However, as we will see below, this simple problem actually breaks standard stochastic optimization methods.

Sampling a mini-batch without replacement , we denote . The extragradient update rule can be written as

(3)

where and are the mini-batches respectively sampled for the update and the extrapolation step. Let us write . Noticing that if and otherwise, we have,

Consequently, if the mini-batch size is smaller than half of the dataset size, i.e. , we have that . ∎

This result may seem contradictory with the standard result on SEG (Juditsky et al., 2011) saying that the average of the iterates computed by SEG does converge to the Nash equilibrium of the game. But one fundamental assumption made by Juditsky et al. is that the estimator of the gradient has a finite variance. This assumption breaks in this example since the variance of the estimator is proportional to the norm of the parameters.

Thus, constraining the optimization problem (2) to bounded domains and ,

(4)

would make the finite variance assumption from Juditsky et al. (2011) hold. Consequently, the averaged iterate would converge to . However we argue in the next section that such result may not be satisfying for non-convex problems.

3.2 Why is convergence of the last iterate preferable?

In the light of Theorem 2, the behavior of the iterates on the constrained problem (4) is the following: they will diverge until they reach the boundary of and and then they will start to turn around the Nash equilibrium of (4) lying on these boundaries. Using convexity properties, we can then show that the averaged iterates will converge to the Nash equilibrium of the problem. However, with an arbitrary large domain, this convergence rate may be arbitrary slow (since it depends on the diameter of the domain).

Moreover, this behavior might be even more problematic in a non-convex framework because even if by chance we initialize close to the Nash equilibrium, we would get away from it and we cannot rely on convexity to expect the average of the iterates to converge.

Consequently, we would like optimization algorithms generating iterates that stay close to the Nash equilibrium.

4 Reducing Noise with VR Methods

One straightforward way to reduce the noise in the estimation of the gradient is to use mini-batches of samples instead of just one sample. However, mini-batches stochastic extragradient fails to converge on (4) if the mini-batch size is smaller than half of the dataset size (see § B.1 for more details). In order to get an estimator of the gradient with a vanishing variance, the optimization literature proposed to take advantage of the finite-sum formulation that often appears in machine learning (Schmidt et al., 2017, and references therein).

4.1 Variance Reduced Methods

Motivated by the GAN setup, let us assume that the objective in (2P-G) can be decomposed as a finite sum such that,

(5)

where we denote .

Johnson and Zhang (2013) propose the “stochastic variance reduced gradient” (SVRG) as an unbiased estimator of the gradient with a smaller variance than the vanilla mini-batch estimate. The idea is to occasionally take a snapshot of the current model’s parameters, and store the full gradient at this point. Computing the full gradient at is an expensive operation but not prohibitive if it is done infrequently enough (for instance once every pass).

Assuming that we have stored and , unbiased estimates of the respective gradient of and are,

(6)
(7)

Actually, , where the expectation is taken over , picked with a probability . We call this unbiased estimate of the gradient the SVRG estimate. SVRG is an epoch based algorithm where an epoch is the inner loop which updates incrementally the parameters using the SVRG estimate between two snapshots which are updated in the outer loop.

The non-uniform sampling probabilities are used to balance samples in spite of large variations of gradient estimates Lipschitz constant. This strategy has been first introduced for variance reduced methods by Xiao and Zhang (2014) for SVRG, and has been discussed for saddle point optimization by Palaniappan and Bach (2016).

Originally, SVRG was introduced as an epoch based algorithm with a fixed epoch size: in Alg. 1, one epoch is an inner loop of size (Line 7). However, Hofmann et al. (2015) proposed instead to sample the size of each epoch from a geometric distribution. Doing so Hofmann et al. defined a notion of -memorization algorithm that unifies SAGA, SVRG and other variants of variance reduction for incremental gradient methods with similar convergence rates. We generalize their notion of -memorization algorithm to handle the extrapolation step (EG) and provide a convergence proof for such -memorization algorithms in § B.2.

One advantage of Hofmann et al. (2015)’s framework is also that the sampling of the epoch size does not depend on the condition number of the problem, whereas the original proof for SVRG had to consider an epoch size larger than the condition number (see Leblond et al. (2018, Corollary 16) for a detailed discussion on the convergence rate for SVRG). Thus, this new version of SVRG with a random epoch size becomes adaptive to the local strong convexity since none of its hyper-parameters depend on the strong convexity constant.

However, because of some technical aspects introduced with monotone operators, Palaniappan and Bach (2016)’s proofs (both for SAGA and SVRG) require a step-size that depends on the strong monotonicity constant making these algorithms not adaptive to local strong monotonicity. This motivates the proposed SVRE algorithm, which is adaptive to local strong monotonicity, and is thus more appropriate for non-convex optimization.

4.2 SVRE: Variance Reduced Extragradient

We describe our proposed algorithm called stochastic variance reduced extragradient (SVRE) in Alg. 1. In an analogous manner that Palaniappan and Bach (2016) combined SVRG with the gradient method to solve games, SVRE combines SVRG estimates of the gradient (6-7) with the extragradient method (EG). While the algorithmic proposal is simple, the proof of convergence is non-trivial. Moreover, with this method we are able to improve the best known convergence rates for variance reduction method for stochastic games (Table 2 and Thm. 3), and we show in § 4.3 that it is the only method which empirically converges on the simple example of § 3.1. We now describe the theoretical setup for the convergence result.

A standard assumption in convex optimization is the assumption of strong convexity of the function. However, in a game, the operator,

(8)

associated with the updates is not anymore the gradient of a single function. In order to make a similar assumption in the context of games, the optimization literature considers the notion of strong monotonicity.

Definition 1.

is said to be -strongly monotone if ,

where we write . A monotone operator is a -strongly monotone operator.

This definition is a generalization of strong convexity for operators: if is -strongly convex, then is a -monotone operator. Note that, in this definition, we used the same weighted Euclidean norm

(9)

as Palaniappan and Bach (2016). They point out that this rescaling of the Euclidean norm is crucial in order to balance the respective players’ objective and getting better constants in the convergence result.

Assumption 1.

For , the functions and are respectively and -smooth and the associated game operator (8) is -strongly monotone.

A function is said to be -smooth if its gradient is -Lipschitz. Under this smoothness assumption on each cost function of the game operator, we can define a smoothness constant adapted to the non-uniform sampling scheme of our stochastic algorithm:

(10)

The standard uniform sampling scheme corresponds to and the optimal non-uniform sampling scheme corresponds to . We always have the bounds:

(11)

We now present our convergence result for SVRE with non-uniform sampling (to make our constants more comparable to those of Palaniappan and Bach (2016)), but note that we have used uniform sampling in all our experiments (for simplicity).

Theorem 3.

Under Assumption 1, after iterations, the iterate computed by SVRE (Alg. 1) with step-size and and sampling scheme verifies:

where and are defined in (10).

We prove this theorem in § B.2. We can notice that only the respective condition numbers of and defined as and appear in our convergence rate. The convergence rate of the (non-accelerated) algorithm of Palaniappan and Bach (2016) depends on the product of these condition numbers . They avoid a dependence on the maximum of the condition numbers squared by using the weighted Euclidean norm  defined in (9) and rescaling the functions and with their strong-monotonicity constant. However, this rescaling trick suffers from two issues: {enumerate*}[series = tobecont, itemjoin =  , label=()]

We do not know in practice a good estimate of the strong monotonicity constant, which was not the case in Palaniappan and Bach (2016)’s application.

The algorithm does not adapt to (larger) local strong-monotonicity. This property is fundamental in non-convex optimization since we want the algorithm to exploit the (potential) local stability properties of a stationary point.

1:  Input: Stopping time , learning rates , both players’ losses and .
2:  Initialize: ,
3:  for  to  do
4:      and
5:      and
6:      \hfill(Sample length of the epoch)
7:     for  to  do {Beginning of the epoch}
8:        Sample , , do extrapolation:
9:         \hfill (6)
10:         \hfill (7)
11:        Sample , , do update:
12:         \hfill (6)
13:         \hfill (7)
14:     end for
15:  end for
16:  Output: ,
Algorithm 1 Pseudocode for SVRE.

4.3 Motivating example

The example presented in (2) seems to be a challenging problem in the stochastic setting since even the stochastic extragradient method fails to find the Nash equilibrium of this game. This problem is actually an unconstrained version of (Gidel et al., 2019, Fig. 3). To explore how SVRE behaves on this problem, we decided to test its empirical performance. We set , and drew , where if and otherwise. Our optimization problem was:

(12)

We compared the following algorithms (which are all using uniform sampling): {enumerate*}[series = tobecont, itemjoin =  , label=()]

AltSGD: Standard stochastic gradient method alternating the update of each player. This is the standard method to train GANs.

SVRE: The algorithm presented in this paper Alg. 1. The Avg prefix correspond to the uniform average of the iterates, . We observe in Fig. 2 that AvgSVRE converges sublinearly (whereas AvgAltSGD fails to converge). This motivates a new variant of SVRE that is based on the ideas developed in § 3.2, that even if the averaged iterate converge we do not compute the gradient at that point and thus we do not benefit from the fact that this iterate is closer to the optimum. Thus the idea is to occasionally restart the algorithm, i.e., consider the averaged iterate as the new starting point of our algorithm and compute the gradient at that point. Restart goes well with SVRG since we already occasionally stop the inner loop to recompute a new snapshot, we make use of this pause to decide (with a probability to be fixed) whether or not we restart our algorithm taking the snapshot at point instead of . This variant of SVRG is described in Alg. 3 in § D and the variant combining VRAd is described in § C.1.

In Fig. 2 we observe that the only method that converges is SVRE and its variants. Note that we do not provide any convergence guarantees for Alg. 3 and leave its analysis for future work. However, it seems interesting that, to our knowledge, this algorithm is the only stochastic algorithm (excluding batch extragradient since it is not a stochastic algorithm) that converge for (2). Note that we tried all the algorithms presented in Fig. 3 from Gidel et al. (2019) on this unconstrained problem and that all of them diverge.

Figure 2: Distance to the Nash equilibrium (NE) of (12). We ran the experiment with 5 different seeds and plotted the average distances to the NE and their standard deviation.

5 Experiments

(a) IS (higher is better)
(b) IS (higher is better)
(c) Generator
Figure 3: Stochastic, full-batch and variance reduced extragradient optimization on MNIST (see § 5 for naming of methods). We used for the SVRE based methods. The input space is , see § C for details on the implementation. SE-A, achieves similar IS performances as & , omitted from 2(a) & 2(b).

Datasets.

To evaluate SVRE, we used the following datasets: {enumerate*}[series = tobecont, itemjoin =  , label=()]

CIFAR-10 (Krizhevsky, 2009, §3),

SVHN Netzer et al. (2011), as well as

ImageNet ILSVRC 2012 Russakovsky et al. (2015), using , , , and resolution, respectively.

Metrics.

To evaluate the performance of SVRE, we conducted experiments on image synthesis with GAN and used the Inception score (IS, Salimans et al., 2016) and the Fréchet Inception distance (FID, Heusel et al., 2017) as performance metrics. IS takes into account both the sample “diversity” and how realistic the samples look, by considering the class distribution and prediction confidence of a trained Inception network Szegedy et al. (2015), respectively. On the other hand, FID compares the synthetic images with those of the training dataset, by embedding large samples of the two in a lower dimensional space, using trained Inception network. To gain insights if SVRE indeed reduces the variance of the gradient estimates (over the iterations) we used the second moment estimate (uncentered variance, herein denoted as SME), computed with an exponentially moving average. SME is computed for each parameter as done in Adam Kingma and Ba (2015), and finally, we plot the averaged value over all the parameters. We refer the reader to § E for details on these metrics.

Optimization methods.

We conduct experiments using the following optimization methods for GANs: {enumerate*}[series = tobecont, itemjoin =  , label=()]

BatchE: full–batch extragradient,

SE: stochastic extragradient, and

SVRE: stochastic variance reduced extragradient. These can be combined with adaptive learning rate methods such as Adam, or with parameter averaging and warm-start of it, hereafter denoted as –A, AVG– and WA–, respectively. With BatchE–A and SE–A we denote respectively full–batch and stochastic extragradient that use Adam. In § C.1, we present a variant of Adam adapted to variance reduced algorithms, that is referred as VRAd. All our experiments use mini-batching with uniform probabilities; the mini-batch version of SVRE for GANs is described in more details in § C.2.

DNN architectures.

For experiments on MNIST, we use the DCGAN (Radford et al., 2016) architecture. For the remaining datasets, we used the SAGAN (Zhang et al., 2018) architecture, as it is used by Brock et al. (2019) demonstrating state-of-art performances on ImageNet. The focus of our experiments is to compare SVRE with baselines in a normalized and realistic setting. However, we lack the needed computational resources to do the extensive hyperparameter tuning typically required to obtain new state-of-the-art results with the latest architectures. In particular, for efficiency, we used architectues of approximately half of the depth of the CIFAR-10 architectures listed in Miyato et al. (2018) (see § E for the details on our implementation).

Figure 4: FID comparison (lower is better) between SVRE and the SE–A baseline on Imagenet.
Figure 5: Comparison of the SVRE methods with the SE baseline on SVHN using the FID metric, where lower is better. SE-A, did not converge and is omitted for clearer illustration.
Inception Score Fréchet Inception Distance
SE–A SVRE SVRE–V SE–A SVRE SVRE–V
MNIST
CIFAR-10
SVHN
ImageNet
Table 2: Summary: best obtained IS and FID scores for the different optimization methods, for a fixed number of iterations (see § E), where SVRE–V, denotes SVRE–VRAd (see § 5). The standard deviation of the Inception scores is around and is omitted. Although the IS metric does not make much sense on SVHN due to the dataset properties (see § E.1), we include it for completeness.

5.1 Results

We conducted experiments on the MNIST common benchmark, for which full-batch extragradient is feasible to compute. Fig. 3 depicts the IS metrics while using either a stochastic, full-batch or variance reduced version of extragradient (see details of SVRE-GAN in § C.2). For the stochastic baseline (SE), we always use the Adam optimization method, as originally proposed (Gidel et al., 2019). From Fig. 2(a), we observe that in terms of the number of parameter updates, SVRE performs similarly to BatchE-A. Note, however, that the latter requires significantly more computation: Fig. 2(b) depicts the IS metric using the number of mini-batch computations as x-axis (a surrogate for the wall-clock time, see discussion below). We observe that, as SE-A has slower per-iteration convergence rate, SVRE converges faster on this set of experiments.

Computational cost.

The relative cost of one pass over the dataset for SVRE versus vanilla SGD is a factor of : the full batch gradient is computed (on average) after one pass over the dataset, giving a slowdown of ; the factor takes into account the extra stochastic gradient computations for the variance reduction, as well as the extrapolation step overhead. Note that some training methods for GANs have similar overhead, e.g. how Arjovsky et al. (2017) used discriminator updates per generator update for training WGAN. Moreover, as SVRE provides less noisy gradient, it may converge faster per iteration, compensating the extra per-update cost. In Fig. 2(b), the x-axis uses an implementation-independent surrogate for wall-clock time that counts the number of mini-batch gradient computations, which is the bottleneck step.

Second moment estimate and Adam.

Fig. 2(c) depicts the averaged second-moment estimate for parameters of the Generator, where we observe that SVRE effectively reduces it over the iterations. The reduction of these values may be the reason why Adam combined with SVRE performs poorly (as these values appear in the denominator of Adam, see § C.1). Nonetheless, SVRE produces good results on several high dimensional datasets. To our knowledge, SVRE is the first optimization method with a constant step size that has worked empirically for GANs on non-trivial datasets.

Comparison on larger datasets.

In Fig. 45, we compare SVRE with the SE-A baseline on the SVHN and ImageNet datasets. We observe that although SE–A in some experiments obtains better performances in the early iterations (starts to converge faster), SVRE allows for obtaining better final performances. In Tab. 2 we summarize the best scores obtained on MNIST, CIFAR-10, SVHN and ImageNet. Note that our experimental part was devised to investigate if SVRE has potential advantages in a realistic GAN setup, and we did not perform exhaustive hyperparameter search and multiple runs with different random seeds, due to a lack of computational resources. § F lists additional experimental analyses.

Comparison using deeper architectures on CIFAR-10 and SVHN.

We report here our preliminary results using the ResNet (He et al., 2015) architecture (see § E.2) used for CIFAR-10 and SVHN in Miyato et al. (2018), on these two datasets as a reference. We compare SE–A and SVRE after K iterations of training, and we obtain FID scores of: {enumerate*}[series = tobecont, itemjoin =  , label=()]

SVHN: and for SE–A and SVRE, resp.;

CIFAR-10: , for SE–A and SVRE, resp. (see § F for details). We observe that these experiments are less stable, and a fair comparison requires more exhaustive hyper-parameter search. More precisely, SE–A diverged in % of the experiments, while using ratio of of updates of G and D, resp. improved the stability. On the other hand, SVRE did not diverge in any of the experiments, but on deeper architectures we observe that SVRE takes longer time to converge compared to SE–A. Moreover, warm-starting from the baseline of after K iterations and then continuing by using SVRE, we obtain a FID of . This could indicate that SVRE may largely benefit if combined with adaptive step size method, what we leave for future work.

6 Related work

Nowadays, two main stochastic variance reduction algorithm considered in the literature are SAGA (Defazio et al., 2014) and SVRG (Johnson and Zhang, 2013). The former is memory based, storing the latest stochastic gradient seen for each sample whereas the latter is epoch based, computing a full batch gradient at the beginning of each epoch. Interestingly, Hofmann et al. (2015) provides a randomized version (for the epoch length) of SVRG under a framework unifying SVRG and SAGA, and this is under that framework that we provide our analysis, thus proving at the same time for SVRE and SAGA with extragradient.

Surprisingly, there exist only a few works on variance reduction methods for monotone operators: (Palaniappan and Bach, 2016) and (Davis, 2016). As mentioned by Palaniappan and Bach (2016), the latter requires a co-coercivity assumption on the operator and thus only convex optimization is considered. Our work provides a new way to use variance reduction for monotone operators, using the extragradient method (Korpelevich, 1976). Recently, Iusem et al. (2017) proposed an extragradient method with variance reduction for an infinite sum of operators. They use mini-batches of growing size in order to reduce the variance of their algorithm and to converge with a constant step-size. However, this growing mini-batch size is prohibitively expensive in our application. Moreover, they are not using the SAGA/SVRG style of updates exploiting the finite sum formulation, leading to sublinear convergence rate, while our method benefits from a linear convergence rate exploiting the finite sum assumption.

Daskalakis et al. (2018) proposed a method called Optimistic-Adam inspired by game theory. This method is closely related to extragradient, with slightly different update scheme. More recently, Gidel et al. (2019) proposed to use extragradient to train GANs, introducing a method called ExtraAdam. This method outperformed Optimistic-Adam when trained on CIFAR-10. Our work is also an attempt to find principled ways to train GANs. Considering that the game aspect is better handled by the extragradient method, we focus on the optimization issues arising from the noise in the training procedure, an under-considered potential issue in GAN training.

Noise in Deep Learning and Games.

In the context of deep learning, despite some very interesting theoretical results on non-convex minimization (Reddi et al., 2016; Allen-Zhu and Hazan, 2016), the effectiveness of variance reduced methods is still an open question, and a recent technical report by Defazio and Bottou (2018) provides negative empirical results on the variance reduction aspect.

In addition, a recent large scale study (Shallue et al., 2018) tried to increase the batch size is a simple way and observed a marginal impact on single objective training, while in contrast it had an effect on GAN results (Brock et al., 2019). In our work we are able to show positive results for variance reduction in a real world deep learning setting. This surprising difference seems to confirm the remarkable discrepancy, that remain poorly understood, between multi-objective optimization and standard minimization.

7 Conclusion

We considered a bilinear game optimization and despite its simplicity compared to real-world GAN optimization problems, we show that stochasticity breaks its convergence. We proposed the stochastic variance reduced extragradient method that combines SVRG with the extragradient method for optimizing games. On the theory side, SVRE improves upon the previous best results for strongly-convex games, whereas empirically, SVRE is the only method that converges for the bilinear game.

Our results showed that SVRE obtained empirically similar convergence speed to Batch-Extragradient on MNIST, with the latter being computationally infeasible for larger datasets. For standard size convolutional neural networks, SVRE showed improved final performances in 2 out of 4 datasets.

Our preliminary experiments with deeper architectures show that SVRE is notably more stable in terms of the choice of hyperparameters, as its stochastic counterpart diverged in all of our experiments, whereas SVRE did not. However, we observe that SVRE takes more iterations to converge when using deeper architectures, suggesting that SVRE can be further extended with adaptive step size method. Considering the notable instabilities when using such architectures and standard optimization methods, our results suggest variance reduction methods as necessary for improving GAN training, unlike single objective optimization.

Acknowledgements

This research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence Research Chair in “Data Science for Realtime Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936 and by a Google Focused Research award. Authors would like to thank Compute Canada for providing the GPUs used for this research.

References

  • Allen-Zhu and Hazan (2016) Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, 2016.
  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • Bottou (2010) L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Brock et al. (2019) A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
  • Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
  • Davis (2016) D. Davis. Smart: The stochastic monotone aggregated root-finding algorithm. arXiv:1601.00698, 2016.
  • Defazio and Bottou (2018) A. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deep learning. arXiv:1812.04529, 2018.
  • Defazio et al. (2014) A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, 2014.
  • Gidel et al. (2019) G. Gidel, H. Berard, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial nets. In ICLR (to appear), 2019.
  • Glorot and Bengio (2010) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • Harker and Pang (1990) P. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, 1990.
  • He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
  • Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  • Hofmann et al. (2015) T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In NIPS, 2015.
  • Iusem et al. (2017) A. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.
  • Johnson and Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.
  • Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 2011.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Korpelevich (1976) G. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 1976.
  • Krizhevsky (2009) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, 2009.
  • Leblond et al. (2018) R. Leblond, F. Pederegosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. JMLR, 19(81):1–68, 2018.
  • (24) Y. Lecun and C. Cortes. The MNIST database of handwritten digits. URL http://yann.lecun.com/exdb/mnist/.
  • Li et al. (2018) J. Li, A. Madry, J. Peebles, and L. Schmidt. On the limitations of first order approximation in GAN dynamics. In ICML, 2018.
  • Lim and Ye (2017) J. H. Lim and J. C. Ye. Geometric GAN. arXiv:1705.02894, 2017.
  • Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.
  • Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
  • Netzer et al. (2011) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. URL http://ufldl.stanford.edu/housenumbers/.
  • Palaniappan and Bach (2016) B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In NIPS, 2016.
  • Radford et al. (2016) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • Reddi et al. (2016) S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In ICML, 2016.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 1951.
  • Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • Schaul et al. (2013) T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In ICML, 2013.
  • Schmidt et al. (2017) M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 2017.
  • Shallue et al. (2018) C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv:1811.03600, 2018.
  • Szegedy et al. (2015) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015.
  • Wilson et al. (2017) A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS, 2017.
  • Xiao and Zhang (2014) L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
  • Zhang et al. (2018) H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-Attention Generative Adversarial Networks. arXiv:1805.08318, 2018.

Appendix A Definitions and Lemmas

a.1 Smoothness and Monotonicity of the operator

Another important property used is the Lipschitzness of an operator.

Definition 2.

A mapping is said to be -Lipschitz if,

(13)
Definition 3.

A differentiable function is said to be -strongly convex if

(14)
Definition 4.

A function is said convex-concave if is convex for all and is concave for all . An is said to be -strongly convex concave if is convex concave.

Definition 5.

For , an operator is said to be -strongly monotone if ,

where we noted .

Appendix B Proof of Theorems

b.1 Proof of Theorem 2

Proof.

Let us consider the optimization problem,

(15)

where if and otherwise. Note that for . Let us consider the extragradient method where to compute an unbiased estimator of the gradients at we sample and use as estimator of the vector flow.

In this proof we note, and the vector such that if and otherwise. Note that and that .

Thus the extragradient update rule can be noted as

(16)

where is the mini-batch (without replacement) sampled for the update and the mini-batch (without replacement) sampled for the extrapolation.

We can thus notice that, when , we have

(17)

and otherwise,

(18)

The intuition is that, on one hand, when (which happens with high probability when , e.g., when , ), the algorithm performs an update that get away from the Nash equilibrium:

(19)

where and . On the other hand, The ”good” updates only happen when is large (which happen with low probability, e.g., when , ):

(20)

Conditioning on and , we get that

(21)

Leading to,

(22)

Plugging these expectations in (20), we get that,

(23)

Consequently,

(24)

Thus, converge if and only if which is not possible when .

To sum-up, if is not large enough (more precisely if ), we have the geometric divergence of the quantity . ∎

b.2 Proof of Theorem 3

Setting of the Proof.

We will prove a slightly more general result than Theorem 3. We will work in the context of monotone operator. Let us consider the general extrapolation update rule,

(25)

where depends on and depends on . Here denotes the Euclidean projection operator on , i.e. .

This update rule generalizes (EG) for 2-player games (2P-G) and ExtraSVRG (Alg. 2).

Let us first state a lemma standard in convex analysis (see for instance [Boyd and Vandenberghe, 2004]),

Lemma 1.

Let and then for all we have,

(26)
Proof of Lemma 1.

We start by simply developing,

Then since is the projection onto the convex set of we have that , leading to the result of the Lemma. ∎

Lemma 2.

If is -strongly monotone for any we have,

(27)

where we noted .

Proof.

By -strong monotonicity and optimality of ,

(28)
(29)

and then we use the inequality to get the result claimed. ∎

Using this update rule we can thus deduce the following lemma, the derivation of this lemma is very similar from the derivation of Harker and Pang [1990, Lemma 12.1.10].

Lemma 3.

Considering the update rule (25), we have for any and any ,

(30)
Proof.

Applying Lem. 1 for and , we get,