Train simultaneously, generalize better:Stability of gradient-based minimax learners

# Train simultaneously, generalize better: Stability of gradient-based minimax learners

## Abstract

The success of minimax learning problems of generative adversarial networks (GANs) has been observed to depend on the minimax optimization algorithm used for their training. This dependence is commonly attributed to the convergence speed and robustness properties of the underlying optimization algorithm. In this paper, we show that the optimization algorithm also plays a key role in the generalization performance of the trained minimax model. To this end, we analyze the generalization properties of standard gradient descent ascent (GDA) and proximal point method (PPM) algorithms through the lens of algorithmic stability under both convex concave and non-convex non-concave minimax settings. While the GDA algorithm is not guaranteed to have a vanishing excess risk in convex concave problems, we show the PPM algorithm enjoys a bounded excess risk in the same setup. For non-convex non-concave problems, we compare the generalization performance of stochastic GDA and GDmax algorithms where the latter fully solves the maximization subproblem at every iteration. Our generalization analysis suggests the superiority of GDA provided that the minimization and maximization subproblems are solved simultaneously with similar learning rates. We discuss several numerical results indicating the role of optimization algorithms in the generalization of the learned minimax models.

## 1 Introduction

Minimax learning frameworks including generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial training (Madry et al., 2017) have recently achieved great success over a wide array of learning tasks. In these frameworks, the learning problem is modeled as a zero-sum game between a ”min” and ”max” player that is solved by a minimax optimization algorithm. The minimax optimization problem of these learning frameworks is typically formulated using deep neural networks, which greatly complicates the theoretical and numerical analysis of the optimization problem. Current studies in the machine learning literature focus on fundamental understanding of general minimax problems with emphasis on convergence speed and optimality.

The primary focus of optimization-related studies of minimax learning problems has been on the convergence and robustness properties of minimax optimization algorithms. Several recently proposed algorithms have been shown to achieve faster convergence rates and more robust behavior around local solutions. However, training speed and robustness are not the only factors required for the success of a minimax optimization algorithm in a learning task. In this work, our goal is to show that the generalization performance of the learned minimax model is another key property that is influenced by the underlying optimization algorithm. To this end, we present theoretical and numerical results demonstrating that:

Different minimax optimization algorithms can learn models with different generalization properties.

In order to analyze the generalization behavior of minimax optimization algorithms, we use the algorithmic stability framework as defined by Bousquet and Elisseeff (2002) for general learning problems and applied by Hardt et al. (2016) for analyzing stochastic gradient descent. Our extension of (Bousquet and Elisseeff, 2002)’s stability approach to minimax settings allows us to analyze and compare the generalization properties of standard gradient descent ascent (GDA) and proximal point method (PPM) algorithms. Furthermore, we compare the generalization performance between the following two types of minimax optimization algorithms: 1) simultaneous update algorithms such as GDA where the minimization and maximization subproblems are simultaneously solved, and 2) non-simultaneous update algorithms such as GDmax where the maximization variable is fully optimized at every iteration.

In our generalization analysis, we consider both the traditional convex concave and general non-convex non-concave classes of minimax optimization problems. For convex concave minimax problems, our bounds indicate a similar generalization performance for simultaneous and non-simultaneous update methods. Specifically, we show for strongly-convex strongly-concave minimax problems all the mentioned algorithms have a bounded generalization risk on the order of with denoting the number of training samples. However, in general convex concave problems we show that the GDA algorithm with a constant learning rate is not guaranteed to have a bounded generalization risk. On the other hand, we prove that proximal point methods still achieve a controlled generalization risk, resulting in a vanishing excess risk with respect to the best minimax learner with the optimal performance on the underlying distribution.

For more general minimax problems, our results indicate that models trained by simultaneous and non-simultaneous update algorithms can achieve different generalization performances. Specifically, we consider the class of non-convex strongly-concave problems where we establish stability-based generalization bounds for both stochastic GDA and GDmax algorithms. Our generalization bounds indicate that the stochastic GDA learner is expected to generalize better provided that the min and max players are trained simultaneously with similar learning rates. In addition, we show a generalization bound for the stochastic GDA algorithm in general non-convex non-concave problems, which further supports the simultaneous optimization of the two min and max players in general minimax settings. Our results indicate that simultaneous training of the two players not only can provide faster training, but also can learn a model with better generalization performance. Our generalization analysis, therefore, revisits the notion of implicit competitive regularization introduced by Schäfer et al. (2019) for simultaneous gradient methods in training GANs.

Finally, we discuss the results of our numerical experiments and compare the generalization performance of GDA and PPM algorithms in convex concave settings and single-step and multi-step gradient-based methods in non-convex non-concave GAN problems. Our numerical results also suggest that in general non-convex non-concave problems the models learned by simultaneous optimization algorithms can generalize better than the models learned by non-simultaneous optimization methods. We can summarize the main contributions of this paper as follows:

• Extending the algorithmic stability framework for analyzing generalization in minimax settings,

• Analyzing the generalization properties of minimax models learned by GDA and PPM algorithms in convex concave problems,

• Studying the generalization of stochastic GDA and GDmax learners in non-convex non-concave problems,

• Providing numerical results on the role of optimization algorithms in the generalization performance of learned minimax models.

## 2 Related Work

Generalization in GANs: Several related papers have studied the generalization properties of GANs. Arora et al. (2017) study the generalization behavior of GANs’ learned models and prove a uniform convergence generalization bound in terms of the number of the discriminator’s parameters. Wu et al. (2019) connect the algorithmic stability notion to differential privacy in GANs and numerically analyze the generalization behavior of GANs. References (Zhang et al., 2017; Bai et al., 2018) show uniform convergence bounds for GANs by analyzing the Rademacher complexity of the players. Feizi et al. (2020) provide a uniform convergence bound for the W2GAN problem. Unlike the mentioned related papers, our work provides algorithm-dependent generalization bounds by analyzing the stability of gradient-based optimization algorithms. Also, the related works (Arora and Zhang, 2017; Thanh-Tung et al., 2019) conduct empirical studies of generalization in GANs using birthday paradox-based and gradient penalty-based approaches, respectively.

Generalization in adversarial training: Understanding generalization in the context of adversarial training has recently received great attention. Schmidt et al. (2018) show that in a simplified Gaussian setting generalization in adversarial training requires more training samples than standard non-adversarial learning. Farnia et al. (2018); Yin et al. (2019); Khim and Loh (2018); Wei and Ma (2019); Attias et al. (2019) prove uniform convergence generalization bounds for adversarial training schemes through Pac-Bayes (McAllester, 1999; Neyshabur et al., 2017b), Rademacher analysis, margin-based, and VC analysis approaches. Zhai et al. (2019) study the value of unlabeled samples in obtaining a better generalization performance in adversarial training. We note that unlike our work the generalization analyses in the mentioned papers prove uniform convergence results. In another related work, Rice et al. (2020) empirically study the generalization performance of adversarially-trained models and suggest that the generalization behavior can significantly change during training.

Stability-based generalization analysis: Algorithmic stability and its connections to the generalization properties of learning algorithms have been studied in several related works. Shalev-Shwartz et al. (2010) discuss learning problems where learnability is feasible considering algorithmic stability, while it is infeasible with uniform convergence. Hardt et al. (2016) bound the generalization risk of the stochastic gradient descent learner by analyzing its algorithmic stability. Feldman and Vondrak (2018, 2019); Bousquet et al. (2020) provide sharper stability-based generalization bounds for standard learning problems. While the above works focus on standard learning problems with a single learner, we use algorithmic stability to analyze generalization in minimax settings with two players.

Connections between generalization and optimization in deep learning: The connections between generalization and optimization in deep learning have been studied in several related works. Analyzing the double descent phenomenon (Belkin et al., 2019; Nakkiran et al., 2019; Mei and Montanari, 2019), the effect of overparameterization on generalization (Li and Liang, 2018; Allen-Zhu et al., 2019; Arora et al., 2019; Cao and Gu, 2019; Wei et al., 2019; Bietti and Mairal, 2019; Allen-Zhu and Li, 2019; Ongie et al., 2019; Ji and Telgarsky, 2019; Bai and Lee, 2019), and the sharpness of local minima (Keskar et al., 2016; Dinh et al., 2017; Neyshabur et al., 2017a) have been performed in the literature to understand the implicit regularization of gradient methods in deep learning (Neyshabur et al., 2014; Zhang et al., 2016; Ma et al., 2018; Lyu and Li, 2019; Chatterjee, 2020). Schäfer et al. (2019) extend the notion of implicit regularization to simultaneous gradient methods in GAN settings and discuss an optimization-based perspective to this regularization mechanism. However, we focus on the generalization aspect of the implicit regularization mechanism. Also, Nagarajan and Kolter (2019) suggest that uniform convergence bounds may be unable to explain generalization in supervised deep learning.

Analyzing convergence and stability of minimax optimization algorithms: A large body of related papers (Heusel et al., 2017; Sanjabi et al., 2018; Lin et al., 2019; Schäfer and Anandkumar, 2019; Fiez et al., 2019; Nouiehed et al., 2019; Hsieh et al., 2019; Du and Hu, 2019; Wang et al., 2019; Mazumdar et al., 2019; Thekumparampil et al., 2019; Farnia and Ozdaglar, 2020; Mazumdar et al., 2020; Zhang et al., 2020) study convergence properties of first-order and second-order minimax optimization algorithms. Also, the related works (Daskalakis et al., 2017; Daskalakis and Panageas, 2018; Gidel et al., 2018; Liang and Stokes, 2019; Mokhtari et al., 2020) analyze the convergence behavior of optimistic methods and extra gradient (EG) methods as approximations of the proximal point method. We also note that we use the algorithmic stability notion as defined by Bousquet and Elisseeff (2002), which is different from the local and global stability properties of GDA methods around optimal solutions studied in the related papers (Mescheder et al., 2017; Nagarajan and Kolter, 2017; Mescheder et al., 2018; Feizi et al., 2020).

## 3 Preliminaries

In this paper, we focus on two standard families of minimax optimization algorithms: Gradient Descent Ascent (GDA) and Proximal Point Method (PPM). To review the update rules of these algorithms, consider the following minimax optimization problem for minimax objective and feasible sets :

 minw∈Wmaxθ∈Θf(w,θ). (1)

Then, for stepsize values , the followings are the GDA’s and GDmax’s update rules:

 (2)

In the above, is the optimal maximizer for . Also, given stepsize parameter the update rule of PPM is as follows:

 (3)

In the Appendix, we also consider and analyze the PPmax algorithm that is a proximal point method fully solving the maximization subproblem at every iteration. Throughout the paper, we commonly use the following assumptions on the Lipschitzness and smoothness of the minimax objective.

###### Assumption 1.

is jointly -Lipschitz in and -Lipschitz in over , i.e., for every we have

 ∣∣f(w,θ)−f(w′,θ′)∣∣≤L√∥w−w′∥22+∥θ−θ′∥22, ∣∣f(w,θ)−f(w′,θ)∣∣≤Lw∥w−w′∥2. (4)
###### Assumption 2.

is continuously differentiable and -smooth on , i.e., is -Lipschitz on and for every we have

 (5)

We focus on several classes of minimax optimization problems based on the convexity properties of the objective function. Note that a differentiable function is called convex in if it satisfies the following inequality for every :

 g(u2)≥g(u1)+∇g(u1)⊤(u2−u1). (6)

Furthermore, is called -strongly-convex if for every it satisfies

 g(u2)≥g(u1)+∇g(u1)⊤(u2−u1)+μ2∥u2−u1∥22. (7)

Also, is called concave and -strongly-concave if is convex and -strongly-convex, respectively.

###### Definition 1.

Consider convex feasible sets in minimax problem (1). Then,

• The problem is called convex concave if and are respectively convex and concave functions for every .

• The problem is called -strongly-convex strongly-concave if and are respectively -strongly-convex and -strongly-concave functions for every .

• The problem is called non-convex -strongly-concave if is -strongly-concave for every .

## 4 Stability-based Generalization Analysis in Minimax Settings

Consider the following optimization problem for a minimax learning task:

 minw∈Wmaxθ∈ΘR(w,θ):=EZ∼PZ[f(w,θ;Z)] (8)

The above minimax objective represents a cost function for minimization and maximization variables and data variable that is averaged under the underlying distribution . We call the objective function the true minimax risk. We also define as the worst-case minimax risk over the maximization variable :

 R(w):=maxθ∈ΘR(w,θ) (9)

In the context of GANs, the worst-case risk represents a divergence measure between the learned and true distributions, and in the context of adversarial training it represents the learner’s risk under adversarial perturbations. Since the learner does not have access to the underlying distribution , we estimate the minimax objective using the empirical samples in dataset which are drawn according to . We define the empirical minimax risk as:

 RS(w,θ):=1nn∑i=1f(w,θ;zi). (10)

Then, the worst-case empirical risk over the maximization variable is defined as

 RS(w):=maxθ∈ΘRS(w,θ). (11)

We define the minimax generalization risk of minimization variable as the difference between the worst-case true and empirical risks:

 ϵ\rm gen(w):=R(w)−RS(w). (12)

The above generalization score measures the difference of empirical and true worst-case minimax risks. For a randomized algorithm which outputs random outcome for dataset we define ’s expected generalization risk as

 ϵ\rm gen(A):=ES,A[R(Aw(S))−RS(Aw(S))]. (13)
###### Definition 2.

A randomized minimax optimization algorithm is called -uniformly stable in minimization if for every two datasets which differ in only one sample, for every we have

 EA[f(Aw(S),θ;z)−f(Aw(S′),θ;z)]≤ϵ. (14)

Considering the above definition, we show the following theorem that connects the definition of uniform stability to the generalization risk of the learned minimax model.

###### Theorem 1.

Assume minimax learner is -uniformly stable in minimization. Then, ’s expected generalization risk is bounded as

 ϵgen(A)≤ϵ. (15)
###### Proof.

We defer the proof to the Appendix. ∎

In the following sections, we apply the above result to analyze generalization for convex concave and non-convex non-concave minimax learning problems.

## 5 Generalization Analysis for Convex Concave Minimax Problems

Analyzing convergence rates for convex concave minimax problems is well-explored in the optimization literature. Here, we use the algorithmic stability framework to bound the expected generalization risk in convex concave minimax learning problems. We start by analyzing the generalization risk in strongly-convex strongly-concave problems. The following theorem applies the stability framework to bound the expected generalization risk under this scenario.

###### Theorem 2.

Let minimax learning objective be -strongly-convex strongly-concave and satisfy Assumption 2 for every . Assume that Assumption 1 holds for convex-concave and every . Then, full-batch and stochastic GDA and GDmax algorithms with stepsize will satisfy the following bounds over iterations:

 ϵ\rm gen(\rm GDA)≤2LLw(μ−αwℓ22)n,ϵ\rm gen(\rm GDmax)≤2L2wμn. (16)
###### Proof.

We defer the proof to the Appendix. In the Appendix, we also prove similar bounds for full-batch and stochastic proximal point methods. ∎

Note that regarding Assumption 1 in the above theorem, we suppose the assumption holds for the deregularized , because a strongly-convex strongly-concave objective cannot be Lipschitz over an unbounded feasible set. We still note that the theorem’s bounds will hold for the original if in Assumption 1 we define ’s Lipschitz constants over bounded feasible sets .

Given sufficiently small stepsizes for GDA, Theorem 2 suggests a similar generalization performance between GDA and GDmax which are different by a factor of . For general convex concave problems, it is well-known in the minimax optimization literature that the GDA algorithm can diverge from an optimal saddle point solution. As we show in the following remark, the generalization bound suggested by the stability framework will also grow exponentially with the iteration count in this scenario.

###### Remark 1.

Consider a convex concave minimax objective satisfying Assumptions 1 and 2. Given constant stepsizes , the GDA’s generalization risk over iterations will be bounded as:

 ϵ\rm gen(GDA)≤O(αLLw(1+α2ℓ2)T/2n). (17)

In particular, the bound’s exponential dependence on is tight for the GDA’s generalization risk in the special case of .

###### Proof.

We defer the proof to the Appendix. ∎

On the other hand, proximal point methods have been shown to resolve the convergence issues of GDA methods in convex concave problems (Mokhtari et al., 2019, 2020). Here, we also show that these algorithms enjoy a generalization risk growing at most linearly with .

###### Theorem 3.

Consider a convex-concave minimax learning objective satisfying Assumptions 1 and 2 for every . Then, full-batch and stochastic PPM with parameter will satisfy the following bound over iterations:

 ϵ\rm gen(\rm PPM)≤2ηLLwTn. (18)
###### Proof.

We defer the proof to the Appendix. In the Appendix, we also show a similar bound for the PPmax algorithm. ∎

The above generalization bound allows us to analyze the true worst-case minimax risk of PPM learners in convex concave problems. To this end, we decompose the true worst-case risk into the sum of the stability and empirical worst-case risks and optimize the sum of these two error components’ upper-bounds. Note that Theorem 3 bounds the generalization risk of PPM in terms of stepsize parameter and number of iterations . Therefore, we only need to bound the iteration complexity of PPM’s convergence to an -approximate saddle point. To do this, we show the following theorem that extends Mokhtari et al. (2019)’s result for PPM to stochastic PPM.

###### Theorem 4.

Given a differentiable minimax objective the average iterate updates of stochastic PPM (SPPM) with setpsize parameter will satisfy the following for a saddle point of the empirical risk under dataset :

 EA[RS(¯w(T))]−RS(w∗S)≤∥∥[w(0),θ(0)]−[w∗S,θ∗S]∥∥222ηT. (19)
###### Proof.

We defer the proof to the Appendix. In the Appendix, we also prove a similar result for stochastic PPmax. ∎

The above convergence result suggests that the expected empirical worst-case risk of applying iterations of stochastic PPM will be at most . In addition, Theorem 3 shows that using that number of iterations the generalization risk will be bounded by . Minimizing the sum of these two error components, the following corollary bounds the excess risk suffered by the PPM algorithm.

###### Corollary 1.

Consider a convex concave minimax objective and a proximal point method optimizer with constant parameter . Given that holds with probability for optimal saddle solution of the minimax risk, it will take iterations for the average iterate of full-batch and stochastic PPM to have the following bounded excess risk:

 ES,A[R(¯w(T\rm PPM))]−R(w∗)≤√2D2LLwn. (20)
###### Proof.

We defer the proof to the Appendix. In the Appendix, we prove a similar bound for full-batch and stochastic PPmax as well. ∎

## 6 Generalization Analysis for Non-convex Non-concave Minimax Problems

In the previous section, we showed that in convex-concave minimax problems simultaneous and non-simultaneous optimization algorithms have similar generalization error bounds which are different by a constant factor . However, here we demonstrate that this result does not generalize to general non-convex non-concave problems. We first study the case of non-convex strongly-concave minimax learning problems, where we can analytically characterize the generalization bounds for both stochastic GDA and GDmax algorithms. The following theorem states the results of applying the algorithmic stability framework to bound the generalization risk in such minimax problems.

###### Theorem 5.

Let learning objective be non-convex -strongly-concave and satisfy Assumptions 1 and 2. Also, we assume that is bounded as for every . Then, defining we have

1. The stochastic GDA (SGDA) algorithm with stepsizes for constants satisfies the following bound over iterations:

 ϵ\rm gen(\rm SGDA)≤1+1(r+1)cℓn(12(r+1)cLLw)1(r+1)cℓ+1T(r+1)cℓ(r+1)cℓ+1. (21)
2. The stochastic GDmax (SGDmax) algorithm with stepsize for constant satisfies the following bound over iterations:

 ϵ\rm gen(\rm SGDmax)≤1+2(κ+2)ℓcn(2cL2w)2(κ+2)ℓc+2T(κ+2)ℓc(κ+2)ℓc+2. (22)
###### Proof.

We defer the proof to the Appendix. ∎

The above result shows that the generalization risks of stochastic GDA and GDmax change with the number of iterations and training set size as:

 ϵ\rm gen(\rm SGDA) ≈O(Tℓ(r+1)cℓ(r+1)c+1/n), ϵ\rm gen(\rm SGDmax) ≈O(Tℓ(κ2+1)cℓ(κ2+1)c+1/n). (23)

Therefore, considering a maximization to minimization stepsize ratio of will result in a better generalization bound for stochastic GDA compared to stochastic GDmax over a fixed and sufficiently large number of iterations.

Next, we consider general non-convex non-concave minimax problems and apply the algorithmic stability framework to bound the generalization risk of the stochastic GDA algorithm. Note that the maximized value of a non-strongly-concave function is in general non-smooth. Consequently, the stability framework does not result in a bounded generalization risk for the GDmax algorithm in general non-convex non-concave problems.

###### Theorem 6.

Let be a bounded non-convex non-concave objective satisfying Assumptions 1 and 2. Then, the SGDA algorithm with stepsizes for constant satisfies the following bound over iterations:

 ϵ\rm gen(\rm SGDA)≤1+1ℓcn(2cLLw)1ℓc+1Tℓcℓc+1. (24)
###### Proof.

We defer the proof to the Appendix. ∎

Theorem 6 also shows that the SGDA algorithm with vanishing stepsize values will have a bounded generalization risk of over iterations. On the other hand, the stochastic GDmax algorithm does not enjoy a bounded algorithmic stablility degree in non-convex non-concave problems, since the optimal maximization value behaves non-smoothly in general.

## 7 Numerical Experiments

Here, we numerically examine the theoretical results of the previous sections. We first focus on a Gaussian setting for analyzing strongly-convex strongly-concave and convex concave minimax problems. Then, we empirically study generative adversarial networks (GANs) as non-convex non-concave minimax learning tasks.

### 7.1 Convex Concave Minimax Problems

To analyze our generalization results for convex concave minimax settings, we considered an isotropic Gaussian data vector with zero mean and identity covariance. In our experiments, we chose ’s dimension to be . We drew independent samples from the underlying Gaussian distribution to form a training dataset . For the -strongly-convex strongly-concave scenario, we considered the following minimax objective:

 f1(w,θ;z)=w⊤(z−θ)+μ2(∥w∥22−∥θ∥22). (25)

In our experiments, we used and constrained the optimization variables to satisfy the norm bounds which we enforced by projection after every optimization step. Note that for the above minimax objective we have

 ϵ\rm gen(w)=w⊤(E[Z]−ES[Z]), (26)

where is the underlying mean and is the empirical mean.

To optimize the empirical minimax risk, we applied stochastic GDA with stepsize parameters and stochastic PPM with parameter each for iterations. Figure 0(a) shows the generalization risk values over the optimization achieved by the stochastic GDA (top) and PPM (bottom) algorithms. As shown in this figure, the absolute value of generalization risk remained bounded during the optimization for both the learning algorithms. In our experiments, we also observed a similar generalization behavior with full-batch GDA and PPM algorithms. We defer the results of those experiments to the supplementary document. Hence, our experimental results support Theorem 2’s generalization bounds.

Regarding convex concave minimax problems, as suggested by Remark 1 we considered the following bilinear minimax objective in our experiments:

 f2(w,θ;z)=w⊤(z−θ). (27)

We constrained the norm of optimization variables as which we enforced through projection after every optimization iteration. Similar to the strongly-convex strongly-concave objective (25), for the above minimax objective we have the generalization risk in (26) with and being the true and empirical mean vectors.

We optimized the minimax objective (27) via stochastic and full-batch GDA and PPM algorithms. Figure 0(b) demonstrates the generalization risk evaluated at different iterations of applying stochastic GDA and PPM algorithms. As suggested by Remark 1, the generalization risk of stochastic GDA grew exponentially over the first 15,000 iterations before the variables reached the boundary of their feasible sets and then the generalization risk oscillated with a nearly constant amplitude of . On the other hand, we observed that the generalization risk of the stochastic PPM algorithm stayed bounded and below for all the 20,000 iterations (Figure 0(b)-bottom). Therefore, our numerical experiments also indicate that while in general convex concave problems the stochastic GDA learner can potentially suffer from a poor generalization performance, the PPM algorithm has a bounded generalization risk as shown by Theorem 3.

### 7.2 Non-convex Non-concave Problems

To numerically analyze generalization in general non-convex non-concave minimax problems, we experimented the performance of simultaneous and non-simultaneous optimization algorithms in training GANs. In our GAN experiments, we considered the standard architecture of DC-GANs (Radford et al., 2015) with 4-layer convolutional neural net generator and discriminator functions. For the minimax objective, we used the formulation of vanilla GAN (Goodfellow et al., 2014) that is

 f(w,θ;z)=log(Dw(z))+Eν[log(1−Dw(Gθ(ν)))]. (28)

For computing the above objective, we used Monte-Carlo simulation using fresh latent samples to approximate the expected value over generator’s latent variable at every optimization step. We followed all the experimental details from Gulrajani et al. (2017)’s standard implementation of DC-GAN. Furthermore, we applied spectral normalization (Miyato et al., 2018) to regularize the discriminator function and assist reaching a near optimal solution for discriminator via boundedly many iterations needed for non-simultaneous optimization methods. We trained the spectrally-normalized GAN (SN-GAN) problem over CIFAR-10 (Krizhevsky and Hinton, 2009) and CelebA (Liu et al., 2018) datasets. We divided the CIFAR-10 and CelebA datasets to 50,000, 160,000 training and 10,000, 40,000 test samples, respectively.

To optimize the minimax risk function, we used the standard Adam algorithm (Kingma and Ba, 2014) with batch-size . For simultaneous optimization algorithms we applied 1,1 Adam descent ascent with the parameters for both minimization and maximization updates. To apply a non-simultaneous algorithm, we used 100 Adam maximization steps per minimization step and increased the maximization learning rate to 5. We ran each GAN experiment for 100,000 iterations.

Figure 2 shows the estimates of the empirical and true minimax risks in the CIFAR-10 and CelebA experiments, respectively. We used randomly-selected samples from the training and test sets for every estimation task. As seen in Figure 2’s plots, for the experiments applying simultaneous 1,1 Adam optimization the empirical minimax risk generalizes properly from training to test samples (Figure 2-top). In contrast, in both the experiments with non-simultaneous methods after 30,000 iterations the empirical minimax risk suffers from a considerable generalization gap from the true minimax risk (Figure 2-bottom). The gap between the training and test minimax risks grew between iterations 30,000-60,000. The test minimax risk fluctuated over the subsequent iterations, which could be due to the insufficiency of 100 Adam ascent steps to follow the optimal discriminator solution at those iterations.

The numerical results of our GAN experiments suggest that non-simultaneous algorithms which attempt to fully solve the maximization subproblem at every iteration can lead to large generalization errors. On the other hand, standard simultaneous algorithms used for training GANs enjoy a bounded generalization error which can help the training process find a model with nice generalization properties. We defer further experimental results to the supplementary document.

## Appendix A Additional Numerical Results

### a.1 Convex Concave Minimax Settings

Here, we provide the results of the numerical experiments discussed in the main text for full-batch GDA and PPM algorithms as well as stochastic and full-batch GDmax algorithms. Note that in these experiments we use the same minimax objective and hyperparameters mentioned in the main text. Figure 3 shows the generalization risk in our experiments for the GDA algorithm. As seen in Figure 3 (right), the results for full-batch and stochastic GDA algorithms in the bilinear convex concave case look similar, with the only exception that the generalization risk in the full-batch case reached a slightly higher amplitude of 7.8. On the other hand, in the strongly-convex strongly-concave case, full-batch GDA demonstrated a vanishing generalization risk, whereas stochastic GDA could not reach below an amplitude of 0.2.

Figure 4 shows the results of our experiments for full-batch PPM. Observe that the generalization risk in both cases decreases to reach smaller values than those for stochastic PPM. Finally, Figures 5 and 6 include the results for ful-batch and stochastic GDmax algorithms. With the exception of the full-batch GDmax case for the bilinear objective (Figure 5-right), in all the other cases the generalization risk did not grow during the optimization, which is comparable to our results in the GDA experiments.

### a.2 Non-convex Non-concave Minimax Settings

Here, we provide the image samples generated by the trained GANs discussed in the main text. Figure 7 shows the CIFAR-10 samples generated by the simultaneous 1,1 Adam training (Figure 7-left) and non-simultaneous 1,100-Adam optimization (Figure 7-right). While we observed that the simultaneous training experiment generated qualitatively sharper samples, the non-simultaneous optimization did not lead to any significant training failures. However, as we discussed in the main text the generalization risk in the non-simultaneous training was significantly larger than that of simultaneous training. Figure 8 shows the generated images in the CelebA experiments, which are qualitatively comparable between the two training algorithms. However, as discussed in the text the trained discriminator had a harder task in classifying the training samples from the generated samples than in classifying the test samples from the generated samples, suggesting a potential overfitting of the training samples in the non-simultaneous training experiment.

## Appendix B Proofs

### b.1 The Expansivity Lemma for Minimax Problems

We will apply the following lemma to analyze the stability of gradient-based methods. We call an update rule -expansive if for every we have

 ∥G(w,θ)−G(w′,θ′)∥2≤γ√∥w−w′∥22+∥θ−θ′∥22. (29)
###### Lemma 1.

Consider the GDA and PPM updates for the following minimax problem

 minw∈Wmaxθ∈Θf(w,θ), (30)

where we assume objective satisfies Assumptions 1 and 2. Then,

1. For a non-convex non-concave minimax problem, is -expansive. Assuming , will be -expansive.

2. For a convex concave minimax problem with , is -expansive and will be -expansive.

3. For a -strongly-convex strongly-concave minimax problem, given that , is -expansive and will be -expansive.

###### Proof.

In Case 1 with non-convex non-concave minimax objective, ’s smoothness property implies that for every and :

 ∥∥G\rm GDA([wθ])−G\rm GDA([w′θ′])∥∥ ≤∥∥[w−w′θ−θ′]∥∥+∥∥[αw(∇wf(w,θ)−∇wf(w′,θ′))αθ(∇θf(w,θ)−∇θf(w′,θ′))]∥∥ (31)

which completes the proof for the GDA update. For the proximal operator, note that given the proximal optimization reduces to optimizing a strongly-convex strongly-concave minimax problem with a unique saddle solution and therefore at we have

 w\rm PPM−w=η∇wf(w\rm PPM% ,θ\rm PPM),θ−θ\rm PPM=η∇θf(w\rm PPM,θ\rm PPM). (32)

As a result, we have

 ∥∥G\rm PPM([wθ])−G\rm PPM([w′θ′])∥∥ = ∥∥[w−w′+η(∇wf(G\rm PPM(w,θ))−∇wf(G\rm PPM(w′,θ′)))θ−θ′−η(∇θf(G\rm PPM(w,θ))−∇θf(G\rm PPM(w′,θ′))]∥∥ ≤ ∥∥[w−w′θ−θ′]∥∥+∥∥[η(∇wf(G\rm PPM(w,θ))−∇wf(G\rm PPM(w′,θ′)))η(∇θf(G\rm PPM(w,θ))−∇θf(G\rm PPM(w′,θ′))]∥∥ ≤ ∥∥[wθ]−[w′θ′]∥∥+ηℓ∥∥G% \rm PPM(w,θ)−G\rm PPM(w′,θ′)∥∥. (33)

The final result of the above inequalities implies that

 (1−ηℓ)∥∥G\rm PPM(w,θ)−G\rm PPM(w′,θ′)∥∥≤∥∥[wθ]−[w′θ′]∥∥, (34)

which completes the proof for the case of non-convex non-concave case.

For convex-concave objectives, the proof is mainly based on the monotonicity of convex concave objective’s gradients (Rockafellar, 1976), implying that for every :

 ([wθ]−[w′θ′])T([∇wf(w,θ)−∇θf(w,θ)]−[∇wf(w′,θ′)−∇θf(w′,θ′)])≥0. (35)

As shown by Rockafellar (1976), the above property implies that the proximal operator for a convex-concave minimax objective will also be monotone and -expansive for any positive choice of . For the GDA update, note that due to the monotonicity property

 ∥∥G\rm GDA([wθ])−G\rm GDA([w′θ′])∥∥22 = ∥∥[w−w′θ−θ′]∥∥22−2αw[w−w′θ−θ′]T[∇wf(w,θ)−∇wf(w′,θ′)−∇θf(w,θ)+∇θf(w′,θ′)] +α2w∥∥[∇wf(w,θ)−∇wf(w′,θ′)∇θf(w,θ)−∇θf(w′,θ′)]∥∥22 ≤ (1+α2wℓ2)∥∥[w−w′θ−θ′]∥∥22, (36)

which results in the following inequality and completes the proof for the convex-concave case:

 ∥∥G\rm GDA([wθ])−G\rm GDA([w′θ′])∥∥2≤√1+α2wℓ2∥∥[w−w′θ−θ′]∥∥2. (37)

Finally, for the strongly-convex strongly-concave case, note that will be convex-concave and hence the proximal update will satisfy

 11+μηw =w\rm PPM+η1+μη∇w~f(w\rm PPM,θ\rm PPM), 11+μηθ =θ\rm PPM−η1+μη∇θ~f(w\rm PPM,θ\rm PPM), (38)

where the right-hand side follows from the proximal update for with stepsize and hence -expansive. Therefore, the proximal update for will be -expansive. Furthemore, for GDA udpates note that

 ∥∥G\rm GDA([wθ])−G\rm GDA([w′θ′])∥∥22 = (1−μαw)2∥∥[w−w′θ−θ′]∥∥22−2(1−μαw)αw[w−w′θ−θ′]T[∇w~f(w,θ)−∇w~f(w′,θ′)−∇θ~f(w,θ)+∇θ~f(w′,θ′)] +α2w∥∥[∇w~f(w,θ)−∇w~f(w′,θ′)∇θ~f(w,θ)−∇θ~f(w′,θ′)]∥∥22 ≤ ((1−μαw)2+α2w(ℓ2−μ2))∥∥[w−w′θ−θ′]∥∥22 ≤ (1−2μαw+α2wℓ2)∥∥[w−w′θ−θ′]∥∥22. (39)

Note that the above result finishes the proof because holds for every , which is based on the lemma’s assumption . Also, the last inequality in the above holds since will be -smooth. This is because is assumed to be -smooth, implying that for every we have

 ℓ2∥∥[w−w′θ−θ′]∥∥22 ≥∥∥[∇wf(w,θ)−∇wf(w′,θ′)∇θf(w,θ)−∇θf(w′,θ′)]∥∥22 (40)

where the inequality uses the monotonicity of the gradient operator. The final inequality shows that will be -smooth and hence finishes the proof. ∎

### b.2 Proof of Theorem 1

###### Theorem.

Assume minimax learner is -uniformly stable in minimization. Then, ’s expected generalization risk is bounded as .

###### Proof.

Here, we provide a proof based on standard techniques in stability-based generalization theory (Bousquet and Elisseeff, 2002). To show this theorem, consider two independent datasets and . Using to denote the dataset with the th sample replaced with , we will have

 ESEA[RS(Aw(S))] =ESEA[1nn∑i=1maxθ∈Θf(Aw(S),θ;zi)] =ESES′EA[1nn∑i=1maxθ∈Θf(Aw(S(i)),θ;z′i)] =ESES′EA[1nn∑i=1maxθ∈Θf(Aw(S),θ;z′i)]+ζ =ESEA[R(Aw(S))]+ζ. (41)

In the above, is defined as

 ζ:=ESES′EA[1nn∑i=1[maxθ∈Θf(Aw(S(i)),θ;z′i)−maxθ′∈Θf(Aw(S),θ′;z′i)]]. (42)

Note that due to the uniform stability assumption for every data point and datasets with only one different sample we have

 maxθ∈Θf(Aw(S),θ;z)−maxθ′∈Θf(Aw(S′),θ′;z)≤maxθ∈Θ{f(Aw(S),θ;z)−f(Aw(S′),θ;z)}≤ϵ. (43)

Therefore, replacing the order of in the above inequality we obtain

 ∣∣maxθ∈Θf(Aw(S),θ;z)−maxθ′∈Θf(Aw(S′),θ′;z)∣∣≤ϵ. (44)

As a result, we conclude that which shows that

 ∣∣ESEA[RS(Aw(S))]−ESEA[R(Aw(S))]∣∣≤ϵ. (45)

The proof is hence complete. ∎

### b.3 Proof of Theorem 2

Note that in the following discussion we define PPmax as a proximal point method which fully optimizes the maximization variable at every iteration with the following update rule:

###### Theorem.

Let minimax learning objective be -strongly convex strongly-concave and satisfy Assumption 2 for every . Assume that Assumption 1 holds for convex-concave and every . Then,

1. Full-batch and Stochastic GDA and GDmax with constant stepsize for iterations will satisfy

 ϵ\rm gen(\rm GDA),ϵ\rm gen(\rm SGDA% )≤2LLw(μ−αwℓ22)n,ϵ%gen(\rm GDmax),ϵ\rm gen(\rm SGDmax)≤2L2wμn. (46)
2. Full-batch and stochastic PPM and PPmax with constant parameter for iterations will satisfy

 ϵ\rm gen(\rm PPM),ϵ\rm gen(\rm SPPM% )≤2LLwμn,ϵ\rm gen(\rm PPmax),ϵ\rm gen(\rm SPPmax)≤2L2wμn. (47)
###### Proof.

We start by proving the following lemmas.

###### Lemma 2 (Growth Lemma).

Consider two sequences of updates and with the same starting point . We define . Then, if is -expansive we have for identical , and in general we have

 δt+1≤min{ξ,1}δt+supw,θ{∥[w,θ]−Gt([w,θ])∥}+supw,θ{∥[w,θ]−G′t([w,θ])∥}. (48)

Furthermore, for any constant we have

 δt+1≤ξδt+supw,<