Beam Subgradient Descent for Minimax Optimization
Abstract
Minimax optimization plays a key role in adversarial training of machine learning algorithms, such as learning generative models, domain adaptation, privacy preservation, and robust learning. In this paper, we demonstrate the failure of alternating gradient descent in minimax optimization problems due to the discontinuity of solutions of the inner maximization. To address this, we propose a new subgradient descent algorithm that addresses this problem by simultaneously tracking candidate solutions. Practically, the algorithm can find solutions that previous saddlepoint algorithms cannot find, with only a sublinear increase of complexity in . We analyze the conditions under which the algorithm converges to the true solution in detail. A significant improvement in stability and convergence speed of the algorithm is observed in simple representative problems, GAN training, and domainadaptation problems.
1 Introduction
There is a wide range of problems in machine learning which can be formulated as continuous minimax optimization problems. Examples include generative adversarial nets (GANs) (Goodfellow et al., 2014), privacy preservation (Hamm, 2015; Edwards & Storkey, 2015), domain adaption (Ganin & Lempitsky, 2015), and robust learning (Globerson & Roweis, 2006) to list a few. More broadly, the problem of finding a worstcase solution or an equilibrium of a leaderfollower game (Brückner & Scheffer, 2011) can be formulated as a minimax problem. Furthermore, the KKT condition for a convex problem can be considered a minimax point of the Lagrangian (Arrow et al., 1958). Efficient solvers for minimax problems can have positive impacts on all these fronts.
To define the problem, consider a realvalued function on a subset . A (continuous) minimax optimization problem is ^{1}^{1}1A more general problem is , but we will assume that the min and the max exist and are achievable, which are explained further in Sec. 3.. It is called a discrete minimax problem if the maximization domain is finite. A related notion is the (global) saddle point which is a point that satisfies
When is convex in and concave in , saddle points coincide with minimax points, due to the Von Neumann’s theorem (v. Neumann, 1928): The problem of finding saddle points has been studied intensively since the seminal work of Arrow et al. (1958), and a gradient descent method was proposed by Uzawa (1958). Much theoretical work has ensued, in particular on the stability of saddle points and convergence (see Sec. 2). However, the cost function in realistic machine learning applications is seldom convexconcave and may not have a global saddle point. Fig. 1 shows motivating examples of surfaces on . Examples (a),(b), and (c) are saddle point problems: all three have a critical point at the origin , which is also a saddle point and a minimax point. However, examples (d),(e), and (f) do not have global saddle points: example (d) has minimax points and examples (e) and (f) have minimax points . These facts are not obvious until one analyzes each surface. (See Appendix for more information.) Furthermore, the nonexistence of saddle points also happens with unconstrained problems: consider the function defined on . The inner maximum has the closedform solution , and the outer minimum is at . Therefore, is the global minimax point (and also a critical point). However, cannot have a saddle point, local or global, since is strictly concave in and respectively.
Despite the fact that saddle points and minimax points are conceptually different, many machine learning applications in the literature do not distinguish the two. This is a mistake, because a local saddle point is only an equilibrium point and is not the robust or worst case solution that problem may be seeking. Furthermore, most papers have used the alternating gradient descent method
(1) 
Alternating descent fails to find minimax points even for 2dimensional examples (d)(f) in Fig. 1 as we show empirically in the Sec. 6.1. To explain the reason for failure, let’s define the inner maximum value and the corresponding maximum points . The main reason for failure is that the solution may not be unique and can be discontinuous w.r.t. . For example, in Fig. 1 (e), we have for and for . This discontinuity at makes it impossible for a gradient descenttype method to keep track of the true inner maximization solution as has to jump between .^{2}^{2}2Also note that a gradient descenttype algorithms will diverge away from which is an antisaddle, i.e., is concaveconvex at instead of convexconcave.
In this paper, we propose a beam approach that tracks candidate solutions (or “beams”) of the inner maximization problem to handle the discontinuity. The proposed subgradient algorithm (Algs. 1 and 2) generalizes the alternating gradientdescent method (=1) and also exact subgradient methods. In the analysis, we prove that it can find minimax points if the inner problem can be approximated well by over a finite set at each , summarized by Theorem 7 which is the main result of analysis. For the purpose of analysis we assume that is convex in similar to the majority of the analyses on gradienttype algorithms. However, we allow to be nonconcave in and have multiple local maxima, which makes our setting much more general than that of classic saddle point problems with convexconcave or previous work which assumed only bilinear couplings between and (Chambolle & Pock, 2011; He & Yuan, 2012).
Practically, the algorithm can find solutions that gradient descent cannot find with only a sublinear increase of time complexity in . To demonstrate the advantages of the algorithm, we test the algorithm on the toy surfaces (Fig. 1) for which we know the true minimax solutions. For realworld demonstrations, we also test the algorithm on GAN problems (Goodfellow et al., 2014), and unsupervised domainadaptation problems (Ganin & Lempitsky, 2015). Examples were chosen so that the performance can be measured objectively – by the JensenShannon divergence for GAN and by crossdomain classification error for domain adaptation. Evaluations show that the proposed beam subgradientdescent approach can significantly improve stability and convergence speed of minimax optimization.
The remainder of the paper is organized as follows. We discuss related work in Sec. 2 and backgrounds in Sec. 3. We propose the main algorithm in Sec. 4, and present the analysis in Sec. 5. The results of experiments are summarized in Sec. 6, and the paper is concluded in Sec. 7. Due to space limits, all proofs in Sec. 5 and additional figures are reported in Appendix. The codes for the project can be found at https://github.com/jihunhamm/kbeamminimax.
2 Related work
Following the seminal work of Arrow et al. (1958) (Chap. 10 of Uzawa (1958) in particular), many researchers have studied the questions of the convergence of (sub)gradient descent for saddle point problems under different stability conditions (Dem’yanov & Pevnyi, 1972; Golshtein, 1972; Maistroskii, 1977; Zabotin, 1988; Nedić & Ozdaglar, 2009). Optimization methods for minimax problems have also been studied somewhat independently. The algorithm proposed by Salmon (1968), referred to as the SalmonDaraban method by Dem’yanov & Pevnyi (1972), finds continuous minimax points by solving successively larger discrete minimax problems. The algorithm can find minimax points for a differentiable on compact and . However, the SalmonDaraban method is impractical, as its requires exact minimization and maximization steps at each iteration, and also because the memory footprint increases linearly with iteration. Another method of continuous minimax optimization was proposed by Dem’yanov & Malozemov (1971, 1974). The grid method, similar to the SalmonDaraban method, iteratively solves a discrete minimax problem to a finite precision using the steepest descent method.
Recently, a large number of papers tried to improve GAN models in particular by modifying the objective (e.g.,Uehara et al. (2016); Nowozin et al. (2016); Arjovsky et al. (2017)), but relatively little attention was paid to the improvement of the optimization itself. Exceptions are the multiadversarial GAN (Durugkar et al., 2016), and the Bayesian GAN (Saatci & Wilson, 2017), both of which used multiple discriminators and have shown improved performance, although no analysis was provided. Also, gradientnorm regularization has been studied recently to stabilize gradient descent (Mescheder et al., 2017; Nagarajan & Kolter, 2017; Roth et al., 2017), which is orthogonal to and can be used simultaneously with the proposed method. Note that there can be multiple causes of instability in minimax optimization, and what we address here is more general and not GANspecific.
3 Backgrounds
Throughout the paper, we assume that is a continuously differentiable function in and separately. A general form of the minimax problem is . We assume that and are compact and convex subsets of Euclidean spaces such as a ball with a large but finite radius. Since is continuous, min and max values are bounded and attainable. In addition, the solutions to min or max problems are assumed to be in the interior of ad , enforced by adding appropriate regularization (e.g, and ) to the optimization problems if necessary.
As already introduced in Sec. 1, the inner maximum value and points are the key objects in the analysis of minimax problems.
Definition.
The maximum value is .
Definition.
The corresponding maximum points is , i.e., .
Note that and are functions of . With abuse of notation, the is the union of maximum points for all , i.e.,
As a generalization, the maximum points are the points
whose values are close to the maximum:
.
Definition.
is the set of local maximum points
Note that for due to differentiability assumption, and that .
Definition.
is a discrete minimax problem if is a finite set .
We accordingly define , and by , and .
We also summarize a few results we will use, which can be found in convex analysis textbooks such as HiriartUrruty & Lemaréchal (2001).
Definition.
An subgradient of a convex function at is that satisfies for all
The subdifferential is the set of all subgradients at .
Consider the convex hull of the set of gradients.
Lemma 1 (Corollary 4.3.2, Theorem 4.4.2, HiriartUrruty & Lemaréchal (2001)).
Suppose is convex in for each . Then . Similarly, suppose is convex in for each . Then .
Definition.
A point is called an stationary point of if for all .
Lemma 2 (Chap 3.6, Dem’yanov & Malozemov (1974)).
A point is an stationary point of if and only if .
4 Algorithm
The alternating gradient descent method predominantly used in the literature fails when the inner maximization has more than one solution, i.e., is not a singleton. To address the problem, we propose the beam method to simultaneously track the maximum points by keeping the candidate set for some large . (The choice for will be discussed in Analysis and Experiments.) This approach can be exact, if the maximum points over the whole domain is finite, as in examples (a),(e) and (f) of Fig. 1 (see Appendix.) In other words, the problem becomes a discrete minimax problem. More realistically, the maximum points is infinite but can still be finite for each , as in all the examples of Fig. 1 except (c). At th iteration, the beam method updates the current candidates such that the discrete maximum is a good approximation to the true . In addition, we present an subgradient algorithm that generalizes exact subgradient algorithms.
4.1 Details of the algorithm
Alg. 1 is the main algorithm for solving minimax problems. At each iteration, the algorithm alternates between the min step and the max step. In the min step, it approximately minimizes by following a subgradient direction . In the max step, it updates to track the local maximum points of so that the approximate subdifferential remains close to the true subdifferential .
The hyperparameters of the algorithm are the beam size (), the total number of iterations (), and the step size schedules for min step and for max step and the approximation schedule .
Alg. 2 is the subroutine for finding a descent direction. If =0, this subroutine identifies the best candidate among the current set and returns its gradient . If , it finds approximate candidates and returns any direction in the convex hull of their gradients. We make a few remarks below.

Alternating gradient descent (1) is a special case of the beam algorithm for and .

As will be shown in the experiments, the algorithm usually performs better with increasing . However, increase in computation can be made negligible, since the updates in the max step can be performed in parallel.

One can use different schemes for the step sizes and . For the purpose of analysis, we use nonsummable but squaresummable step size, e.g., . Any decreasing sequence can be used.

The algorithm uses subgradients since the maximum value is nondifferentiable even if is, when there are more than one maximum point (Danskin, 1967). In practice, when is close to 0, the approximate maximum set in Alg. 2 is often a singleton in which case the descent direction from Alg. 2 is simply the gradient .

Checking the stopping criterion can be nontrivial (see Sec. 5.4), and may be skipped in practice.
5 Analysis
We analyze the conditions under which Alg. 1 and Alg. 2 find a minimax point. We want the finite set at th iteration to approximate the true maximum points well, which we measure by the following two distances. Firstly, we want the following onesided Hausdorff distance
(2) 
to be small, i.e., each global maximum is close to at least one candidate in . Secondly, we also want the following onesided Hausdorff distance
(3) 
to be small, where is the local maxima, i.e., each candidate is close to at least one local maximum . This requires that is at least as large as .
We discuss the consequences of these requirements more precisely in the rest
of the section.
For the purpose of analysis, we will make the following additional assumptions.
Assumptions.
is convex and achieves the minimum .
Also, is Lipschitz in for all , and
is Lipschitz in for all .
Remark on the assumption. Note that we only assume the convexity of over and not the concavity over , which makes this setting more general than that of classic analyses which assume the concavity over , or that of restricted models with a bilinear coupling . While we allow to be nonconcave in and have multiple local maxima, we also require and to be Lipschitz in for the purpose of analysis.
5.1 Finite , exact max step
If is finite for each , and if the maximization in the max step can be done exactly as assumed in the SalmonDaraban method (Salmon, 1968), then the problem is no more difficult than a discrete minimax problem.
Lemma 3.
Suppose is finite at . If , then and therefore .
Since the subdifferential is exact, Alg. 1 finds a minimax solution as does the subgradientdescent method with the true . We omit the proof and present a more general theorem shortly.
5.2 Finite , inexact max step
Exact maximization in each max step is unrealistic, unless can be solved in closed form. Therefore we consider what happens to the convergence of the algorithm with an approximate max step. If and for some , how close are and in the vicinity of ? The following lemmas answer this question. (See Appendix for a visual aid.) From the smoothness assumptions on , we have
Lemma 4.
If , then for each there is one or more such that and .
The following lemma shows that if approximates well, then chosen by Alg. 2 is not far from a true maximum .
Lemma 5.
Assume and are both finite at . Let be the smallest gap between the global and the nonglobal maximum values at . If all local maxima are global maxima, then set . If and where , then for each , there is such that .
Furthermore, the subgradients at the approximate maximum points are close to the subgradients at the true maximum points.
Lemma 6.
Suppose is chosen as in Lemma 5 and is bounded: . Then any is an subgradient of .
Now we state our main theorem that if the max step is accurate enough for a large in terms of (a property of ) and (chosen by a user), then the algorithm finds the minimum value using a step size .
Theorem 7.
For and we can also use . The can be any nonnegative value. A large can make each min step better since the descent direction in Alg. 2 uses more ’s and therefore is more robust. The price to pay is that it may take more iterations for the max step to meet the condition .
5.3 Infinite
Infinite is the most challenging case. We only mention the accuracy of the approximating with a finite and fixed as in the grid methods of Dem’yanov & Malozemov (1971, 1974).
Lemma 8.
For any , one can choose a fixed such that holds for all . Furthermore, if is the minimizer of the approximation, then .
If is dense enough, the solution can be made arbitrarily accurate, but the corresponding can be too large and has to be limited in practice.
5.4 Optional stopping criteria
The function is nonsmooth and its gradient need not vanish at the minimum, causing oscillations. A stopping criterion can help to terminate early. We can stop at an stationary point of by checking if from Lemma 2. Algorithmically, this check is done by solving a LP or a QP problem (Dem’janov, 1968). The stopping criterion presented in Alg. 2 is a necessary condition for the approximate stationarity of :
Lemma 9.
Let where is the Lipschitz coefficient of in . If is an stationary point of , then is an stationary point of .
The size of the QP problem is which is small for , but it can be costly to solve at every iteration. It is therefore more practical to stop after a maximum number of iterations or by checking the stopping criterion only every so often.
6 Experiments
6.1 Simple surfaces
We test the proposed algorithm to find minimax points of the simple surfaces in Fig. 1. We compare Alternating Gradient Descent (AltGD), and the proposed beam algorithm with . Note that for , the minimax algorithm is basically the same as AltGD. Since the domain is constrained to , we use the projected gradient at each step with the common learning rate of . In our preliminary tests, the value of in Alg. 1 did not critically affect the results, and we report the case for all subsequent tests. The experiments are repeated for 100 trials with random initial conditions.
Fig. 2 shows the convergence of AltGD and beam () after 200 iterations, measured by the distance of the current solution to the closest optimal point , where is the set of minimax solutions. We plot the average and the confidence level of the 100 trials. All methods converge well for surfaces (a) and (b). The surface (c) is more difficult. Although is a saddle point, (i.e., ), the point is unstable as it has no open neighborhood in which is a local minimum in and a local maximum in . For nonsaddle point problems (d)(e), one can see that AltGD simply cannot find the true solution, whereas beam can find the solution if is large enough. For antisaddle (e), is the smallest number to find the solution since the local maximum point is at most 2. However, concavityconvexity of (instead of convexityconcavity) makes optimization difficult and therefore helps to recover from bad random initial points and find the solution.
6.2 GAN training with MoG
We train GANs with the proposed algorithm to learn a generative model of twodimensional mixtures of Gaussians (MoGs). Let be a sample from the MoG with the density ,
and be a sample from the 256dimensional Gaussian distribution . The optimization problem is
where and are generator and discriminator networks respectively. Both and are twolayer tanh networks with 128 hidden units per layer, trained with Adam optimizer with batch size 128 and the learning rate of for the discriminator and for the generator.
For evaluation, we measure the JensenShannon divergence
between the true MoG and the samples from the generator. We measure the divergence by discretizing the 2D region into bins and compare the histograms of 64,000 random samples from the generator and 640,000 samples from the MoG. The top row, Fig. 3, shows the JSD curves of beam with =1,2,5,10. AltGD performs nearly the same as =1 and is omitted. The results are from 10 trials with random initialization. Note first that GAN training is sensitive in that each trial curve is jagged and often falls into the “mode collapsing” where there is a jump in the curve. With increasing, the curve converges faster on average and is more stable as evidenced by the shrinking variance. The bottom row, Fig. 3, shows the corresponding samples from the generators after 10,000, 20,000, and 50,000 iterations from all 10 trials. The generated samples are also qualitatively better with increasing.
Additionally, we measure the runtime of the algorithms by wall clock on the same system using a single NVIDIA GTX980 4GB GPU with a single Intel Core i72600 CPU. Even on a single GPU, the runtime per iteration increases only sublinear in K: relative to the time required for =1, we get 1.07 (=2), 1.63 (=5), and 2.26 (=10). Since the advantages are clear and the incurred time is negligible, there is a strong motivation to use the proposed method instead of AltGD.
6.3 Unsupervised domain adaptation
We perform experiments on unsupervised domain adaptation (Ganin & Lempitsky, 2015) which is another example of minimax problems. In domain adaption, it is assumed that two data sets belonging to different domains share the same structure. For examples, MNIST and MNISTM are both images of handwritten digits 0–9, but MNISTM is in color and has random background patches. Not surprisingly, the classifier trained on MNIST does not perform well with digits from MNISTM out of the box. Unsupervised domain adaption tries to learn a common transformation of the domains into another representation/features such that the distributions of the two domains are as similar as possible while preserving the digit class information. The discriminator tries to predict the domain accurately, and the target classifier tries to predict the label correctly. The optimization problem can be rewritten as with
which is the weighted difference of the expected risks of the domain classifier and the digit classifier . This form of minimax problem has also been proposed earlier by Hamm (2015, 2017) to remove sensitive information from data. In this experiment, we show domain adaptation results. The transformer is a twolayer ReLU convolutional network that maps the input features (=images) to an internal representation of dim=2352. The discriminator is a singlelayer ReLU dense network of 100 hidden units, and the digit classifier is a twolayer ReLU dense network of 100 hidden units. All networks are trained with the momentum optimizer with the batch size of 128 and the learning rate of . The experiments are repeated for 10 trials with random initialization. We use .
We performed the task of predicting the class of MNISTM digits, trained using labeled examples of MNIST and unlabeled examples of MNISTM. Fig. 4 shows the classification error of indomain (top row) and crossdomain (bottom row) prediction tasks as a function of iterations. Again we omit the result of AltGD as it performs nearly the same as =1. With small, the average error is high for both indomain and crossdomain tests, due to failed optimization which can be observed in the traces of the trials. As increases, instability disappears and both indomain and crossdomain errors converge to their lowest values.
Summary and discussions

Experiments with 2D surfaces clearly show that the alternating gradientdescent method can fail completely when the minimax points are not local saddle points, while the beam method can find the true solutions.

For GAN and domain adaptation problems involving nonlinear neural networks, the beam and AltGD can both find good solutions if they converge. The key difference is, the beam consistently converges to a good solution, whereas AltGD finds the solution only rarely (which are the bottom yellow curves for =1 in Fig. 3 and Fig. 4.) Similar results can be observed in GANMNIST experiments in Appendix.

The true value cannot be computed analytically for nontrivial functions. However, an overestimated does not hurt the performance theoretically – it is only redundant. One the other hand, an underestimated can be suboptimal but is still better than =1. Therefore, in practice, one can choose as large a number as allowed by resource limits such as =5 or 10.

The beam method is different from running AltGD for times more iterations, since the instability of AltGD hinders convergence regardless of the total number of iterations. The beam method is also different from parallel independent runs of AltGD, which are basically the figures of =1 in Fig. 3 and Fig. 4, but with times more trials. The variance will be reduced but the average curve will remain similar.
7 Conclusions
In this paper, we propose the beam subgradient descent algorithm to solve continuous minimax problems that appear frequently in machine learning. While simple in implementation, the proposed algorithm can significantly improve the convergence of optimization compared to the alternating gradient descent approach as demonstrated by synthetic and realworld examples. We analyze the conditions for convergence without assuming concavity or bilinearity, which we believe is the first result in the literature. There are open questions regarding possible relaxations of assumptions used which are left for future work.
References
 Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Arrow et al. (1958) Arrow, Kenneth Joseph, Hurwicz, Leonid, Uzawa, Hirofumi, and Chenery, Hollis Burnley. Studies in linear and nonlinear programming. Stanford University Press, 1958.
 Boyd et al. (2003) Boyd, Stephen, Xiao, Lin, and Mutapcic, Almir. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2003.
 Brückner & Scheffer (2011) Brückner, Michael and Scheffer, Tobias. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–555. ACM, 2011.
 Chambolle & Pock (2011) Chambolle, Antonin and Pock, Thomas. A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
 Correa & Lemaréchal (1993) Correa, Rafael and Lemaréchal, Claude. Convergence of some algorithms for convex minimization. Mathematical Programming, 62(1):261–275, 1993.
 Danskin (1967) Danskin, John M. The theory of maxmin and its application to weapons allocation problems. Springer, 1967.
 Dem’janov (1968) Dem’janov, Vladimir F. Algorithms for some minimax problems. Journal of Computer and System Sciences, 2(4):342–380, 1968.
 Dem’yanov & Malozemov (1971) Dem’yanov, Vladimir Fedorovich and Malozemov, Vassili Nikolaevich. On the theory of nonlinear minimax problems. Russian Mathematical Surveys, 26(3):57–115, 1971.
 Dem’yanov & Malozemov (1974) Dem’yanov, Vladimir Fedorovich and Malozemov, Vassili Nikolaevich. Introduction to minimax. John Wiley & Sons, 1974.
 Dem’yanov & Pevnyi (1972) Dem’yanov, Vladimir Fedorovich and Pevnyi, Aleksandr Borisovich. Numerical methods for finding saddle points. USSR Computational Mathematics and Mathematical Physics, 12(5):11–52, 1972.
 Durugkar et al. (2016) Durugkar, Ishan, Gemp, Ian, and Mahadevan, Sridhar. Generative multiadversarial networks. arXiv preprint arXiv:1611.01673, 2016.
 Edwards & Storkey (2015) Edwards, Harrison and Storkey, Amos. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015.
 Ganin & Lempitsky (2015) Ganin, Yaroslav and Lempitsky, Victor. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1180–1189, 2015.
 Globerson & Roweis (2006) Globerson, Amir and Roweis, Sam. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pp. 353–360. ACM, 2006.
 Golshtein (1972) Golshtein, EG. Generalized gradient method for finding saddlepoints. Matekon, 10(3):36–52, 1972.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Hamm (2015) Hamm, Jihun. Preserving privacy of continuous highdimensional data with minimax filters. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 324–332, 2015.
 Hamm (2017) Hamm, Jihun. Minimax filter: learning to preserve privacy from inference attacks. The Journal of Machine Learning Research, 18(1):4704–4734, 2017.
 He & Yuan (2012) He, Bingsheng and Yuan, Xiaoming. Convergence analysis of primaldual algorithms for a saddlepoint problem: From contraction perspective. SIAM Journal on Imaging Sciences, 5(1):119–149, 2012.
 HiriartUrruty & Lemaréchal (2001) HiriartUrruty, JeanBaptiste and Lemaréchal, Claude. Fundamentals of convex analysis. Springer, 2001.
 Maistroskii (1977) Maistroskii, D. Gradient methods for finding saddle points. Matekon, 14(1):3–22, 1977.
 Mescheder et al. (2017) Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1823–1833, 2017.
 Nagarajan & Kolter (2017) Nagarajan, Vaishnavh and Kolter, J Zico. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5591–5600, 2017.
 Nedić & Ozdaglar (2009) Nedić, Angelia and Ozdaglar, Asuman. Subgradient methods for saddlepoint problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
 Nowozin et al. (2016) Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. fgan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
 Roth et al. (2017) Roth, Kevin, Lucchi, Aurelien, Nowozin, Sebastian, and Hofmann, Thomas. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pp. 2015–2025, 2017.
 Saatci & Wilson (2017) Saatci, Yunus and Wilson, Andrew G. Bayesian gan. In Advances in neural information processing systems, pp. 3622–3631, 2017.
 Salmon (1968) Salmon, D. Minimax controller design. IEEE Transactions on Automatic Control, 13(4):369–376, 1968.
 Uehara et al. (2016) Uehara, Masatoshi, Sato, Issei, Suzuki, Masahiro, Nakayama, Kotaro, and Matsuo, Yutaka. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
 Uzawa (1958) Uzawa, H. Iterative methods in concave programming. In Arrow, KJ, Hurwicz, L, and Uzawa, H (eds.), Studies in linear and nonlinear programming, chapter 10, pp. 154–165. Stanford University Press, Stanford, CA, 1958.
 v. Neumann (1928) v. Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
 Zabotin (1988) Zabotin, IY. A subgradient method for finding a saddle point of a convexconcave function. Issled. Prikl. Mat, 15:6–12, 1988.
Appendix
Appendix A Simple surfaces
critical pts saddle pts minimax pts 





Fig.1 shows the six surfaces and the maximum value function
.
From one can check the minima are:
(a) , (b) , (c) , (d) , (e) , and (f) .
The corresponding maxima at the minimum are:
(a) , (b) , (c) ,
(d) , (e) , and (f) .
Furthermore, for the whole domain is:
(a) , (b) ,
(c) except for , (d) ,
(e) , and (f) .
These can be verified by solving the minimax problems in closed form.
Note that the origin is a critical point for all surfaces. It is also a global saddle point and minimax point for surfaces (a)(c), but is neither a saddle nor a minimax point for surfaces (d)(f).
Appendix B Proofs
Lemma 1 (Corollary 4.3.2, Theorem 4.4.2, (HiriartUrruty & Lemaréchal, 2001)).
Suppose is convex in for each . Then . Similarly, suppose is convex in for each . Then .
Lemma 2 (Chap 3.6, (Dem’yanov & Malozemov, 1974)).
A point is an stationary point of if and only if .
Lemma 3.
Suppose is finite at . If , then and therefore .
Proof.
Since , . By , we have and therefore for each , , so . Conversely, if then , so . The remainder of the theorem follows from the definition of subdifferentials. ∎
Fig. 6 explains several symbols used in the following lemmas.
Lemma 4.
If , then for each there is one or more such that and .
The proof follows directly from the Lipschitz assumptions.
Lemma 5.
Assume and are both finite at . Let be the smallest gap between the global and the nonglobal maximum values at . If all local maxima are global maxima, then set . If and where , then for each , there is such that .
Proof.
Let any be close to a global maximum, then . Similarly, let any be close to a nonglobal maximum, then . Consequently, , i.e., any and are separated by at least . Therefore, each satisfies but no satisfies . ∎
Lemma 6.
Suppose is chosen as in Lemma 5 and is bounded (.) Then any is an subgradient of .
Proof.
Theorem 7.
Note that a stronger result such as is possible (see, e.g., (Correa & Lemaréchal, 1993)), but we give a simpler proof similar to (Boyd et al., 2003) which assumes for some .
Proof.
We combine previous lemmas with the standard proof of the subgradient descent method. Let . Then,
from the definition of . Taking on both sides gives us