K-Beam Subgradient Descent for Minimax Optimization

# K-Beam Subgradient Descent for Minimax Optimization

Jihun Hamm    Yung-Kyun Noh
###### Abstract

Minimax optimization plays a key role in adversarial training of machine learning algorithms, such as learning generative models, domain adaptation, privacy preservation, and robust learning. In this paper, we demonstrate the failure of alternating gradient descent in minimax optimization problems due to the discontinuity of solutions of the inner maximization. To address this, we propose a new -subgradient descent algorithm that addresses this problem by simultaneously tracking candidate solutions. Practically, the algorithm can find solutions that previous saddle-point algorithms cannot find, with only a sublinear increase of complexity in . We analyze the conditions under which the algorithm converges to the true solution in detail. A significant improvement in stability and convergence speed of the algorithm is observed in simple representative problems, GAN training, and domain-adaptation problems.

## 1 Introduction

There is a wide range of problems in machine learning which can be formulated as continuous minimax optimization problems. Examples include generative adversarial nets (GANs) (Goodfellow et al., 2014), privacy preservation (Hamm, 2015; Edwards & Storkey, 2015), domain adaption (Ganin & Lempitsky, 2015), and robust learning (Globerson & Roweis, 2006) to list a few. More broadly, the problem of finding a worst-case solution or an equilibrium of a leader-follower game (Brückner & Scheffer, 2011) can be formulated as a minimax problem. Furthermore, the KKT condition for a convex problem can be considered a minimax point of the Lagrangian (Arrow et al., 1958). Efficient solvers for minimax problems can have positive impacts on all these fronts.

To define the problem, consider a real-valued function on a subset . A (continuous) minimax optimization problem is 111A more general problem is , but we will assume that the min and the max exist and are achievable, which are explained further in Sec. 3.. It is called a discrete minimax problem if the maximization domain is finite. A related notion is the (global) saddle point which is a point that satisfies

 f(u∗,v)≤f(u∗,v∗)≤f(u,v∗),∀(u,v)∈U×V.

Despite the fact that saddle points and minimax points are conceptually different, many machine learning applications in the literature do not distinguish the two. This is a mistake, because a local saddle point is only an equilibrium point and is not the robust or worst case solution that problem may be seeking. Furthermore, most papers have used the alternating gradient descent method

 u←u−ρ∇uf(u,v),andv←v+η∇vf(u,v). (1)

Alternating descent fails to find minimax points even for 2-dimensional examples (d)-(f) in Fig. 1 as we show empirically in the Sec. 6.1. To explain the reason for failure, let’s define the inner maximum value and the corresponding maximum points . The main reason for failure is that the solution may not be unique and can be discontinuous w.r.t. . For example, in Fig. 1 (e), we have for and for . This discontinuity at makes it impossible for a gradient descent-type method to keep track of the true inner maximization solution as has to jump between .222Also note that a gradient descent-type algorithms will diverge away from which is an anti-saddle, i.e., is concave-convex at instead of convex-concave.

In this paper, we propose a -beam approach that tracks candidate solutions (or “beams”) of the inner maximization problem to handle the discontinuity. The proposed -subgradient algorithm (Algs. 1 and  2) generalizes the alternating gradient-descent method (=1) and also exact subgradient methods. In the analysis, we prove that it can find minimax points if the inner problem can be approximated well by over a finite set at each , summarized by Theorem 7 which is the main result of analysis. For the purpose of analysis we assume that is convex in similar to the majority of the analyses on gradient-type algorithms. However, we allow to be non-concave in and have multiple local maxima, which makes our setting much more general than that of classic saddle point problems with convex-concave or previous work which assumed only bilinear couplings between and (Chambolle & Pock, 2011; He & Yuan, 2012).

Practically, the algorithm can find solutions that gradient descent cannot find with only a sublinear increase of time complexity in . To demonstrate the advantages of the algorithm, we test the algorithm on the toy surfaces (Fig. 1) for which we know the true minimax solutions. For real-world demonstrations, we also test the algorithm on GAN problems (Goodfellow et al., 2014), and unsupervised domain-adaptation problems (Ganin & Lempitsky, 2015). Examples were chosen so that the performance can be measured objectively – by the Jensen-Shannon divergence for GAN and by cross-domain classification error for domain adaptation. Evaluations show that the proposed -beam subgradient-descent approach can significantly improve stability and convergence speed of minimax optimization.

The remainder of the paper is organized as follows. We discuss related work in Sec. 2 and backgrounds in Sec. 3. We propose the main algorithm in Sec. 4, and present the analysis in Sec. 5. The results of experiments are summarized in Sec. 6, and the paper is concluded in Sec. 7. Due to space limits, all proofs in Sec. 5 and additional figures are reported in Appendix. The codes for the project can be found at https://github.com/jihunhamm/k-beam-minimax.

## 2 Related work

Following the seminal work of Arrow et al. (1958) (Chap. 10 of Uzawa (1958) in particular), many researchers have studied the questions of the convergence of (sub)gradient descent for saddle point problems under different stability conditions (Dem’yanov & Pevnyi, 1972; Golshtein, 1972; Maistroskii, 1977; Zabotin, 1988; Nedić & Ozdaglar, 2009). Optimization methods for minimax problems have also been studied somewhat independently. The algorithm proposed by Salmon (1968), referred to as the Salmon-Daraban method by Dem’yanov & Pevnyi (1972), finds continuous minimax points by solving successively larger discrete minimax problems. The algorithm can find minimax points for a differentiable on compact and . However, the Salmon-Daraban method is impractical, as its requires exact minimization and maximization steps at each iteration, and also because the memory footprint increases linearly with iteration. Another method of continuous minimax optimization was proposed by Dem’yanov & Malozemov (1971, 1974). The grid method, similar to the Salmon-Daraban method, iteratively solves a discrete minimax problem to a finite precision using the -steepest descent method.

Recently, a large number of papers tried to improve GAN models in particular by modifying the objective (e.g.,Uehara et al. (2016); Nowozin et al. (2016); Arjovsky et al. (2017)), but relatively little attention was paid to the improvement of the optimization itself. Exceptions are the multiadversarial GAN (Durugkar et al., 2016), and the Bayesian GAN (Saatci & Wilson, 2017), both of which used multiple discriminators and have shown improved performance, although no analysis was provided. Also, gradient-norm regularization has been studied recently to stabilize gradient descent (Mescheder et al., 2017; Nagarajan & Kolter, 2017; Roth et al., 2017), which is orthogonal to and can be used simultaneously with the proposed method. Note that there can be multiple causes of instability in minimax optimization, and what we address here is more general and not GAN-specific.

## 3 Backgrounds

Throughout the paper, we assume that is a continuously differentiable function in and separately. A general form of the minimax problem is . We assume that and are compact and convex subsets of Euclidean spaces such as a ball with a large but finite radius. Since is continuous, min and max values are bounded and attainable. In addition, the solutions to min or max problems are assumed to be in the interior of ad , enforced by adding appropriate regularization (e.g, and ) to the optimization problems if necessary.

As already introduced in Sec. 1, the inner maximum value and points are the key objects in the analysis of minimax problems.

###### Definition.

The maximum value is .

###### Definition.

The corresponding maximum points is , i.e., .

Note that and are functions of . With abuse of notation, the is the union of maximum points for all , i.e.,

As a generalization, the -maximum points are the points whose values are -close to the maximum:
.

###### Definition.

is the set of local maximum points

 S(u) := {v0∈V|∃r>0s.t.∀v∈V, ∥v0−v∥≤r⇒f(u,v0)≥f(u,v)}.

Note that for due to differentiability assumption, and that .

###### Definition.

is a discrete minimax problem if is a finite set .

We accordingly define , and by , and .

We also summarize a few results we will use, which can be found in convex analysis textbooks such as Hiriart-Urruty & Lemaréchal (2001).

###### Definition.

An -subgradient of a convex function at is that satisfies for all

 ϕ(u)−ϕ(u0)≥⟨g,u−u0⟩−ϵ.

The -subdifferential is the set of all -subgradients at .

Consider the convex hull of the set of gradients.

###### Lemma 1 (Corollary 4.3.2, Theorem 4.4.2, Hiriart-Urruty & Lemaréchal (2001)).

Suppose is convex in for each . Then . Similarly, suppose is convex in for each . Then .

###### Definition.

A point is called an -stationary point of if for all .

###### Lemma 2 (Chap 3.6, Dem’yanov & Malozemov (1974)).

A point is an -stationary point of if and only if .

## 4 Algorithm

The alternating gradient descent method predominantly used in the literature fails when the inner maximization has more than one solution, i.e., is not a singleton. To address the problem, we propose the -beam method to simultaneously track the maximum points by keeping the candidate set for some large . (The choice for will be discussed in Analysis and Experiments.) This approach can be exact, if the maximum points over the whole domain is finite, as in examples (a),(e) and (f) of Fig. 1 (see Appendix.) In other words, the problem becomes a discrete minimax problem. More realistically, the maximum points is infinite but can still be finite for each , as in all the examples of Fig. 1 except (c). At -th iteration, the -beam method updates the current candidates such that the discrete maximum is a good approximation to the true . In addition, we present an -subgradient algorithm that generalizes exact subgradient algorithms.

### 4.1 Details of the algorithm

Alg. 1 is the main algorithm for solving minimax problems. At each iteration, the algorithm alternates between the min step and the max step. In the min step, it approximately minimizes by following a subgradient direction . In the max step, it updates to track the local maximum points of so that the approximate subdifferential remains close to the true subdifferential .

The hyperparameters of the algorithm are the beam size (), the total number of iterations (), and the step size schedules for min step and for max step and the approximation schedule .

Alg. 2 is the subroutine for finding a descent direction. If =0, this subroutine identifies the best candidate among the current set and returns its gradient . If , it finds -approximate candidates and returns any direction in the convex hull of their gradients. We make a few remarks below.

• Alternating gradient descent (1) is a special case of the -beam algorithm for and .

• As will be shown in the experiments, the algorithm usually performs better with increasing . However, increase in computation can be made negligible, since the updates in the max step can be performed in parallel.

• One can use different schemes for the step sizes and . For the purpose of analysis, we use non-summable but square-summable step size, e.g., . Any decreasing sequence can be used.

• The algorithm uses subgradients since the maximum value is non-differentiable even if is, when there are more than one maximum point (Danskin, 1967). In practice, when is close to 0, the approximate maximum set in Alg. 2 is often a singleton in which case the descent direction from Alg. 2 is simply the gradient .

• The convergence of the algorithm (Sec. 5) is not affected by the random choice in Alg. 2. In practice, the random choice can help to avoid local minima if is not convex.

• Checking the stopping criterion can be non-trivial (see Sec. 5.4), and may be skipped in practice.

## 5 Analysis

We analyze the conditions under which Alg. 1 and Alg. 2 find a minimax point. We want the finite set at -th iteration to approximate the true maximum points well, which we measure by the following two distances. Firstly, we want the following one-sided Hausdorff distance

 dH(R(ui),Ai):=maxv∈R(ui)minv′∈Ai∥v−v′∥ (2)

to be small, i.e., each global maximum is close to at least one candidate in . Secondly, we also want the following one-sided Hausdorff distance

 dH(Ai,S(ui)):=maxv′∈Aiminv∈S(ui)∥v−v′∥ (3)

to be small, where is the local maxima, i.e., each candidate is close to at least one local maximum . This requires that is at least as large as .

We discuss the consequences of these requirements more precisely in the rest of the section. For the purpose of analysis, we will make the following additional assumptions.
Assumptions. is convex and achieves the minimum . Also, is -Lipschitz in for all , and is -Lipschitz in for all .

Remark on the assumption. Note that we only assume the convexity of over and not the concavity over , which makes this setting more general than that of classic analyses which assume the concavity over , or that of restricted models with a bilinear coupling . While we allow to be non-concave in and have multiple local maxima, we also require and to be Lipschitz in for the purpose of analysis.

### 5.1 Finite R(u), exact max step

If is finite for each , and if the maximization in the max step can be done exactly as assumed in the Salmon-Daraban method (Salmon, 1968), then the problem is no more difficult than a discrete minimax problem.

###### Lemma 3.

Suppose is finite at . If , then and therefore .

Since the subdifferential is exact, Alg. 1 finds a minimax solution as does the subgradient-descent method with the true . We omit the proof and present a more general theorem shortly.

### 5.2 Finite R(u), inexact max step

Exact maximization in each max step is unrealistic, unless can be solved in closed form. Therefore we consider what happens to the convergence of the algorithm with an approximate max step. If and for some , how close are and in the vicinity of ? The following lemmas answer this question. (See Appendix for a visual aid.) From the smoothness assumptions on , we have

###### Lemma 4.

If , then for each there is one or more such that and .

The following lemma shows that if approximates well, then chosen by Alg. 2 is not far from a true maximum .

###### Lemma 5.

Assume and are both finite at . Let be the smallest gap between the global and the non-global maximum values at . If all local maxima are global maxima, then set . If and where , then for each , there is such that .

Furthermore, the subgradients at the approximate maximum points are close to the subgradients at the true maximum points.

###### Lemma 6.

Suppose is chosen as in Lemma 5 and is bounded: . Then any is an -subgradient of .

Now we state our main theorem that if the max step is accurate enough for a large in terms of (a property of ) and (chosen by a user), then the algorithm finds the minimum value using a step size .

###### Theorem 7.

Suppose the conditions of Lemmas 45 and  6 hold, and also suppose the max step in Alg. 1 is accurate for sufficiently large for some so that holds where for some non-negative sequence If the step size satisfies , , , and , then converges to the minimum value .

For and we can also use . The can be any non-negative value. A large can make each min step better since the descent direction in Alg. 2 uses more ’s and therefore is more robust. The price to pay is that it may take more iterations for the max step to meet the condition .

### 5.3 Infinite R(u)

Infinite is the most challenging case. We only mention the accuracy of the approximating with a finite and fixed as in the grid methods of Dem’yanov & Malozemov (1971, 1974).

###### Lemma 8.

For any , one can choose a fixed such that holds for all . Furthermore, if is the minimizer of the approximation, then .

If is dense enough, the solution can be made arbitrarily accurate, but the corresponding can be too large and has to be limited in practice.

### 5.4 Optional stopping criteria

The function is non-smooth and its gradient need not vanish at the minimum, causing oscillations. A stopping criterion can help to terminate early. We can stop at an -stationary point of by checking if from Lemma 2. Algorithmically, this check is done by solving a LP or a QP problem (Dem’janov, 1968). The stopping criterion presented in Alg. 2 is a necessary condition for the approximate stationarity of :

###### Lemma 9.

Let where is the Lipschitz coefficient of in . If is an -stationary point of , then is an -stationary point of .

The size of the QP problem is which is small for , but it can be costly to solve at every iteration. It is therefore more practical to stop after a maximum number of iterations or by checking the stopping criterion only every so often.

## 6 Experiments

### 6.1 Simple surfaces

We test the proposed algorithm to find minimax points of the simple surfaces in Fig. 1. We compare Alternating Gradient Descent (Alt-GD), and the proposed -beam algorithm with . Note that for , the minimax algorithm is basically the same as Alt-GD. Since the domain is constrained to , we use the projected gradient at each step with the common learning rate of . In our preliminary tests, the value of in Alg. 1 did not critically affect the results, and we report the case for all subsequent tests. The experiments are repeated for 100 trials with random initial conditions.

Fig. 2 shows the convergence of Alt-GD and -beam () after 200 iterations, measured by the distance of the current solution to the closest optimal point , where is the set of minimax solutions. We plot the average and the confidence level of the 100 trials. All methods converge well for surfaces (a) and (b). The surface (c) is more difficult. Although is a saddle point, (i.e., ), the point is unstable as it has no open neighborhood in which is a local minimum in and a local maximum in . For non-saddle point problems (d)-(e), one can see that Alt-GD simply cannot find the true solution, whereas -beam can find the solution if is large enough. For anti-saddle (e), is the smallest number to find the solution since the local maximum point is at most 2. However, concavity-convexity of (instead of convexity-concavity) makes optimization difficult and therefore helps to recover from bad random initial points and find the solution.

### 6.2 GAN training with MoG

We train GANs with the proposed algorithm to learn a generative model of two-dimensional mixtures of Gaussians (MoGs). Let be a sample from the MoG with the density ,

and be a sample from the 256-dimensional Gaussian distribution . The optimization problem is

 minumaxvE[logD(x;v)+log(1−D(G(z;u);v))],

where and are generator and discriminator networks respectively. Both and are two-layer tanh networks with 128 hidden units per layer, trained with Adam optimizer with batch size 128 and the learning rate of for the discriminator and for the generator.

For evaluation, we measure the Jensen-Shannon divergence

 JSD=12KL(P,P+Q2)+12KL(Q,P+Q2)

between the true MoG and the samples from the generator. We measure the divergence by discretizing the 2D region into bins and compare the histograms of 64,000 random samples from the generator and 640,000 samples from the MoG. The top row, Fig. 3, shows the JSD curves of -beam with =1,2,5,10. Alt-GD performs nearly the same as =1 and is omitted. The results are from 10 trials with random initialization. Note first that GAN training is sensitive in that each trial curve is jagged and often falls into the “mode collapsing” where there is a jump in the curve. With increasing, the curve converges faster on average and is more stable as evidenced by the shrinking variance. The bottom row, Fig. 3, shows the corresponding samples from the generators after 10,000, 20,000, and 50,000 iterations from all 10 trials. The generated samples are also qualitatively better with increasing.

Additionally, we measure the runtime of the algorithms by wall clock on the same system using a single NVIDIA GTX980 4GB GPU with a single Intel Core i7-2600 CPU. Even on a single GPU, the runtime per iteration increases only sublinear in K: relative to the time required for =1, we get 1.07 (=2), 1.63 (=5), and 2.26 (=10). Since the advantages are clear and the incurred time is negligible, there is a strong motivation to use the proposed method instead of Alt-GD.

We perform experiments on unsupervised domain adaptation (Ganin & Lempitsky, 2015) which is another example of minimax problems. In domain adaption, it is assumed that two data sets belonging to different domains share the same structure. For examples, MNIST and MNIST-M are both images of handwritten digits 0–9, but MNIST-M is in color and has random background patches. Not surprisingly, the classifier trained on MNIST does not perform well with digits from MNIST-M out of the box. Unsupervised domain adaption tries to learn a common transformation of the domains into another representation/features such that the distributions of the two domains are as similar as possible while preserving the digit class information. The discriminator tries to predict the domain accurately, and the target classifier tries to predict the label correctly. The optimization problem can be rewritten as with

 f(u,v)=−E[D1(G(x;u′);v)]+λE[D2(G(x;u′);w)],

which is the weighted difference of the expected risks of the domain classifier and the digit classifier . This form of minimax problem has also been proposed earlier by Hamm (2015, 2017) to remove sensitive information from data. In this experiment, we show domain adaptation results. The transformer is a two-layer ReLU convolutional network that maps the input features (=images) to an internal representation of dim=2352. The discriminator is a single-layer ReLU dense network of 100 hidden units, and the digit classifier is a two-layer ReLU dense network of 100 hidden units. All networks are trained with the momentum optimizer with the batch size of 128 and the learning rate of . The experiments are repeated for 10 trials with random initialization. We use .

We performed the task of predicting the class of MNISTM digits, trained using labeled examples of MNIST and unlabeled examples of MNISTM. Fig. 4 shows the classification error of in-domain (top row) and cross-domain (bottom row) prediction tasks as a function of iterations. Again we omit the result of Alt-GD as it performs nearly the same as =1. With small, the average error is high for both in-domain and cross-domain tests, due to failed optimization which can be observed in the traces of the trials. As increases, instability disappears and both in-domain and cross-domain errors converge to their lowest values.

Summary and discussions

• Experiments with 2D surfaces clearly show that the alternating gradient-descent method can fail completely when the minimax points are not local saddle points, while the -beam method can find the true solutions.

• For GAN and domain adaptation problems involving nonlinear neural networks, the -beam and Alt-GD can both find good solutions if they converge. The key difference is, the -beam consistently converges to a good solution, whereas Alt-GD finds the solution only rarely (which are the bottom yellow curves for =1 in Fig. 3 and Fig. 4.) Similar results can be observed in GAN-MNIST experiments in Appendix.

• The true value cannot be computed analytically for nontrivial functions. However, an overestimated does not hurt the performance theoretically – it is only redundant. One the other hand, an underestimated can be suboptimal but is still better than =1. Therefore, in practice, one can choose as large a number as allowed by resource limits such as =5 or 10.

• The -beam method is different from running Alt-GD for -times more iterations, since the instability of Alt-GD hinders convergence regardless of the total number of iterations. The -beam method is also different from -parallel independent runs of Alt-GD, which are basically the figures of =1 in Fig. 3 and Fig. 4, but with -times more trials. The variance will be reduced but the average curve will remain similar.

## 7 Conclusions

In this paper, we propose the -beam subgradient descent algorithm to solve continuous minimax problems that appear frequently in machine learning. While simple in implementation, the proposed algorithm can significantly improve the convergence of optimization compared to the alternating gradient descent approach as demonstrated by synthetic and real-world examples. We analyze the conditions for convergence without assuming concavity or bilinearity, which we believe is the first result in the literature. There are open questions regarding possible relaxations of assumptions used which are left for future work.

## References

• Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
• Arrow et al. (1958) Arrow, Kenneth Joseph, Hurwicz, Leonid, Uzawa, Hirofumi, and Chenery, Hollis Burnley. Studies in linear and non-linear programming. Stanford University Press, 1958.
• Boyd et al. (2003) Boyd, Stephen, Xiao, Lin, and Mutapcic, Almir. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2003.
• Brückner & Scheffer (2011) Brückner, Michael and Scheffer, Tobias. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–555. ACM, 2011.
• Chambolle & Pock (2011) Chambolle, Antonin and Pock, Thomas. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
• Correa & Lemaréchal (1993) Correa, Rafael and Lemaréchal, Claude. Convergence of some algorithms for convex minimization. Mathematical Programming, 62(1):261–275, 1993.
• Danskin (1967) Danskin, John M. The theory of max-min and its application to weapons allocation problems. Springer, 1967.
• Dem’janov (1968) Dem’janov, Vladimir F. Algorithms for some minimax problems. Journal of Computer and System Sciences, 2(4):342–380, 1968.
• Dem’yanov & Malozemov (1971) Dem’yanov, Vladimir Fedorovich and Malozemov, Vassili Nikolaevich. On the theory of non-linear minimax problems. Russian Mathematical Surveys, 26(3):57–115, 1971.
• Dem’yanov & Malozemov (1974) Dem’yanov, Vladimir Fedorovich and Malozemov, Vassili Nikolaevich. Introduction to minimax. John Wiley & Sons, 1974.
• Dem’yanov & Pevnyi (1972) Dem’yanov, Vladimir Fedorovich and Pevnyi, Aleksandr Borisovich. Numerical methods for finding saddle points. USSR Computational Mathematics and Mathematical Physics, 12(5):11–52, 1972.
• Durugkar et al. (2016) Durugkar, Ishan, Gemp, Ian, and Mahadevan, Sridhar. Generative multi-adversarial networks. arXiv preprint arXiv:1611.01673, 2016.
• Edwards & Storkey (2015) Edwards, Harrison and Storkey, Amos. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015.
• Ganin & Lempitsky (2015) Ganin, Yaroslav and Lempitsky, Victor. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1180–1189, 2015.
• Globerson & Roweis (2006) Globerson, Amir and Roweis, Sam. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pp. 353–360. ACM, 2006.
• Golshtein (1972) Golshtein, EG. Generalized gradient method for finding saddlepoints. Matekon, 10(3):36–52, 1972.
• Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
• Hamm (2015) Hamm, Jihun. Preserving privacy of continuous high-dimensional data with minimax filters. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 324–332, 2015.
• Hamm (2017) Hamm, Jihun. Minimax filter: learning to preserve privacy from inference attacks. The Journal of Machine Learning Research, 18(1):4704–4734, 2017.
• He & Yuan (2012) He, Bingsheng and Yuan, Xiaoming. Convergence analysis of primal-dual algorithms for a saddle-point problem: From contraction perspective. SIAM Journal on Imaging Sciences, 5(1):119–149, 2012.
• Hiriart-Urruty & Lemaréchal (2001) Hiriart-Urruty, Jean-Baptiste and Lemaréchal, Claude. Fundamentals of convex analysis. Springer, 2001.
• Maistroskii (1977) Maistroskii, D. Gradient methods for finding saddle points. Matekon, 14(1):3–22, 1977.
• Mescheder et al. (2017) Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1823–1833, 2017.
• Nagarajan & Kolter (2017) Nagarajan, Vaishnavh and Kolter, J Zico. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5591–5600, 2017.
• Nedić & Ozdaglar (2009) Nedić, Angelia and Ozdaglar, Asuman. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
• Nowozin et al. (2016) Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. f-gan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
• Roth et al. (2017) Roth, Kevin, Lucchi, Aurelien, Nowozin, Sebastian, and Hofmann, Thomas. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pp. 2015–2025, 2017.
• Saatci & Wilson (2017) Saatci, Yunus and Wilson, Andrew G. Bayesian gan. In Advances in neural information processing systems, pp. 3622–3631, 2017.
• Salmon (1968) Salmon, D. Minimax controller design. IEEE Transactions on Automatic Control, 13(4):369–376, 1968.
• Uehara et al. (2016) Uehara, Masatoshi, Sato, Issei, Suzuki, Masahiro, Nakayama, Kotaro, and Matsuo, Yutaka. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
• Uzawa (1958) Uzawa, H. Iterative methods in concave programming. In Arrow, KJ, Hurwicz, L, and Uzawa, H (eds.), Studies in linear and non-linear programming, chapter 10, pp. 154–165. Stanford University Press, Stanford, CA, 1958.
• v. Neumann (1928) v. Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
• Zabotin (1988) Zabotin, IY. A subgradient method for finding a saddle point of a convex-concave function. Issled. Prikl. Mat, 15:6–12, 1988.

## Appendix A Simple surfaces

Fig.1 shows the six surfaces and the maximum value function . From one can check the minima are:
(a) , (b) , (c) , (d) , (e) , and (f) .
The corresponding maxima at the minimum are:
(a) , (b) , (c) , (d) , (e) , and (f) .

Furthermore, for the whole domain is:
(a) , (b) , (c) except for , (d) , (e) , and (f) . These can be verified by solving the minimax problems in closed form.

Note that the origin is a critical point for all surfaces. It is also a global saddle point and minimax point for surfaces (a)-(c), but is neither a saddle nor a minimax point for surfaces (d)-(f).

## Appendix B Proofs

###### Lemma 1 (Corollary 4.3.2, Theorem 4.4.2, (Hiriart-Urruty & Lemaréchal, 2001)).

Suppose is convex in for each . Then . Similarly, suppose is convex in for each . Then .

###### Lemma 2 (Chap 3.6, (Dem’yanov & Malozemov, 1974)).

A point is an -stationary point of if and only if .

###### Lemma 3.

Suppose is finite at . If , then and therefore .

###### Proof.

Since , . By , we have and therefore for each , , so . Conversely, if then , so . The remainder of the theorem follows from the definition of subdifferentials. ∎

Fig. 6 explains several symbols used in the following lemmas.

###### Lemma 4.

If , then for each there is one or more such that and .

The proof follows directly from the Lipschitz assumptions.

###### Lemma 5.

Assume and are both finite at . Let be the smallest gap between the global and the non-global maximum values at . If all local maxima are global maxima, then set . If and where , then for each , there is such that .

###### Proof.

Let any be -close to a global maximum, then . Similarly, let any be -close to a non-global maximum, then . Consequently, , i.e., any and are separated by at least . Therefore, each satisfies but no satisfies . ∎

###### Lemma 6.

Suppose is chosen as in Lemma 5 and is bounded (.) Then any is an -subgradient of .

###### Proof.

From Lemmas 4 and  5, for each , there is such that Let and . Then, for all and for all ,

 ϕ(u)−ϕ(u0)−⟨z′k,u−u0⟩ = ϕ(u)−ϕ(u0)−⟨zk+z′k−zk,u−u0⟩ ≥ −⟨z′k−zk,u−u0⟩ ≥ −∥z′k−zk∥∥u−u0∥ ≥ −rδ∥u−u0∥≥−2rδB.

By taking any convex combination of on both sides, we have

 ϕ(u)−ϕ(u0)−⟨n∑k=1akz′k,u−u0⟩≥−2rδB,

and therefore any is a -subgradient of

###### Theorem 7.

Suppose the conditions of Lemmas 45 and  6 hold, and also suppose the max step in Alg.2 is accurate for sufficiently large for some so that holds where for some non-negative sequence If the step size satisfies , , , and , then converges to the minimum value .

Note that a stronger result such as is possible (see, e.g., (Correa & Lemaréchal, 1993)), but we give a simpler proof similar to (Boyd et al., 2003) which assumes for some .

###### Proof.

We combine previous lemmas with the standard proof of the -subgradient descent method. Let . Then,

 ∥ui+1−u∗∥2 = ∥ui−u∗∥2+ρ2i∥gi∥2+2ρi⟨gi,u∗−ui⟩ ≤ ∥ui−u∗∥2+ρ2i∥gi∥2+2ρi(ϕ(u∗)−ϕ(ui)+ξi)

from the definition of . Taking on both sides gives us

 ∥uN+1−u∗∥2 ≤ ∥u1−u∗∥2+N∑i=1ρ2i