A SingleLoop Smoothed Gradient DescentAscent Algorithm for NonconvexConcave MinMax Problems
Abstract
Nonconvexconcave minmax problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descentascent (GDA) algorithm which unfortunately can exhibit oscillation in case of nonconvexity. In this paper, we introduce a “smoothing” scheme which can be combined with GDA to stabilize the oscillation and ensure convergence to a stationary solution. We prove that the stabilized GDA algorithm can achieve an iteration complexity for minimizing the pointwise maximum of a finite collection of nonconvex functions. Moreover, the smoothed GDA algorithm achieves an iteration complexity for general nonconvexconcave problems. Extensions of this stabilized GDA algorithm to multiblock cases are presented. To the best of our knowledge, this is the first algorithm to achieve for a class of nonconvexconcave problem. We illustrate the practical efficiency of the stabilized GDA algorithm on robust training.
1 Introduction
Minmax problems have drawn considerable interest from the machine learning and other engineering communities. They appear in applications such as adversarial learning [18, 1, 30], robust optimization [2, 10, 35, 36], empirical risk minimization [56, 45], and reinforcement learning [11, 8]. Concretely speaking, a minmax problem is in the form:
(1.1) 
where and are convex and closed sets and is a smooth function. In the literature, the convexconcave minmax problem, where is convex in and concave in , is wellstudied [37, 39, 34, 42, 16, 31, 19, 33]. However, many practical applications involve nonconvexity, and this is the focus of the current paper. Unlike the convexconcave setting where we can compute the global stationary solution efficiently, to obtain a global optimal solution for the setting where is nonconvex with respect to is difficult.
In this paper, we consider the nonconvexconcave minmax problem (1.1) where is nonconvex in but concave of , as well as a special case in the following form:
(1.2) 
where is a probability simplex and
is a smooth map from to . Note that (1.2) is equivalent to the problem of minimizing the pointwise maximum of a finite collection of functions:
(1.3) 
If is a loss function or a negative utility function at a data point , then problem (1.3) is to find the best parameter of the worst data points. This formulation is frequently used in machine learning and other fields. For example, adversarial training [40, 30], fairness training [40] and distributionagnostic metalearning [7] can be formulated as (1.3). We will discuss the formulations for these applications in details in Section 2.
Algorithm  Complexity  Simplicity  Multiblock 

[41]  Tripleloop  ✗  
[27]  Tripleloop  ✗  
[40]  Doubleloop  ✗  
[29]  Singleloop  ✔  
This paper  Singleloop  ✔ 
Recently, various algorithms have been proposed for nonconvexconcave minmax problems [46, 22, 44, 40, 26, 29, 41, 27]. These algorithms can be classified into three types based on the structure: singleloop, doubleloop and triple loop. Here a singleloop algorithm is an iterative algorithm where each iteration step has a closed form update, while a doubleloop algorithm uses an iterative algorithm to approximately solve the subproblem at each iteration. A tripleloop algorithm uses a doubleloop algorithm to approximately solve a subproblem at every iteration. To find an stationary solution, doubleloop and tripeloop algorithms have two main drawbacks. First, these existing multiloop algorithms require at least outer iterations, while the iteration numbers of the other inner loop(s) also depend on . Thus, the iteration complexity of the existing multiloop algorithms is more than for (1.2). Among all the existing algorithms, the best known iteration complexity is from two tripleloop algorithms [41, 27]. Since the bestknown lower bound for solving (1.2) using firstorder algorithms is , so there is a gap between the existing upper bounds and the lower bound. Another drawback of multiloop algorithms is their difficulty in solving problems with multiblock structure, since the acceleration steps used in their inner loops cannot be easily extended to multiblock cases, and a standard doubleloop algorithm without acceleration can be very slow. This is unfortunate because the minmax problems with block structure is important for distributed training [29] in machine learning and signal processing.
Due to the aforementioned two drawbacks of doubleloop and tripleloops algorithms, we focus in this paper on singleloop algorithms in hope to achieve the optimal iteration complexity for the nonconvexconcave problem (1.2). Notice that the nonconvexconcave applications in the aforementioned studies [46, 22, 44, 40, 26, 29, 41, 27] can all be formulated as (1.2), although the iteration complexity results derived in these papers are only for general nonconvexconcave problems. In other words, the structure of (1.2) is not used in the theoretical analysis. One natural question to ask is: can we design a single loop algorithm with an iteration complexity lower than for the minmax problem (1.2)?
Existing Singleloop algorithms. A simple singleloop algorithm is the socalled Gradient Descent Ascent (GDA) which alternatively performs gradient descent to the minimization problem and gradient ascent to the maximization problem. GDA can generate an stationary solution for a nonconvexstronglyconcave problem with iteration complexity [26]. However, GDA will oscillate with constant stepsizes around the solution if the maximization problem is not strongly concave [33]. So the stepsize should be proportional to if we want an solution. These limitations slow down GDA which has an iteration complexity for nonconvexconcave problems. Another singleloop algorithm [29] requires diminishing stepsizes to guarantee convergence and its complexity is . [52] also proposes a singleloop algorithm for minmax problems by performing GDA to a regularized version of the original minmax problem and the regularization term is diminishing. The iteration complexity bounds given in the references [29, 26, 52] are worse than the ones from multiloop algorithms using acceleration in the subproblems.
In this paper, we propose a singleloop “smoothed gradient descentascent” algorithm with optimal iteration complexity for the nonconvexconcave problem (1.2). Inspired by [55], to fix the oscillation issue of GDA discussed above, we introduce an exponentially weighted sequence of the primal iteration sequence and include a quadratic proximal term centered at to objective function. Then we perform a GDA step to the proximal function instead of the original objective. With this smoothing technique, an iteration complexity can be achieved for problem (1.2) under mild assumptions. Our contributions are three fold.

Optimal order in convergence rate. We propose a singleloop algorithm SmoothedGDA for nonconvexconcave problems which finds an stationary solution within iterations for problem (1.2) under mild assumptions.

General convergence results. The SmoothedGDA algorithm can also be applied to solve general nonconvexconcave problems with an iteration complexity. This complexity is the same as in [29]. However, the current algorithm does not require the compactness of the domain , which significantly extends the applicability of the algorithm.

Multiblock settings. We extend the SmoothedGDA algorithm to the multiblock setting and give the same convergence guarantee as the oneblock case.
The paper is organized as follows. In Section 2, we describe some applications of nonconvexconcave problem (1.2) or (1.3). The details of the SmoothedGDA algorithm as well as the main theoretical results are given in Section 3. The proof sketch is given in Section 4. The proofs and the details of the numerical experiments are in the appendix.
2 Representative Applications
We give three application examples which are in the minmax form (1.2).
1. Robust learning from multiple distributions. Suppose the data set is from distributions: . Each is a different perturbed version of the underlying true distribution . Robust training is formulated as minimizing the maximum of expected loss over the distributions as
(2.1) 
where is a probability simplex, represents the loss with model parameter on a data sample . Notice that is the expected loss under distribution . In adversarial learning [30, 24, 17], corresponds to the distribution that is used to generate adversarial examples. In Section 5, we will provide a detailed formulation of adversarial learning on the data set MNIST and apply the Smoothed GDA algorithm to this application.
2. Fair models. In machine learning, it is common that the models may be unfair, i.e. the models might discriminate against individuals based on their membership in some group [20, 12]. For example, an algorithm for predicting a person’s salary might use that person’s protected attributes, such as gender, race, and color. Another example is training a logistic regression model for classification which can be biased against certain categories. To promote fairness, [32] proposes a framework to minimize the maximum loss incurred by the different categories:
(2.2) 
where represents the model parameters and is the corresponding loss for category .
3. Distributionagnostic metalearning. Metalearning is a field about learning to learn, i.e. to learn the optimal model properties so that the model performance can be improved. One popular choice of metalearning problem is called gradientbased ModelAgnostic MetaLearning (MAML) [14]. The goal of MAML is to learn a good global initialization such that for any new tasks, the model still performs well after one gradient update from the initialization.
One limitation of MAML is that it implicitly assumes the tasks come from a particular distribution, and optimizes the expected or sample average loss over tasks drawn from this distribution. This limitation might lead to arbitrarily bad worstcase performance and unfairness. To mitigate these difficulties, [7] proposed a distributionagnostic formulation of MAML:
(2.3) 
Here, is the loss function associated with the th task, is the parameter taken from the feasible set , and is the stepsize used in the MAML for the gradient update. Notice that each is still a function over , even though we take one gradient step before evaluating the function. This formulation (2.3) finds the initial point that minimizes the objective function after one step of gradient over all possible loss functions. It is shown that solving the distributionagnostic metalearning problem improves the worstcase performance over that of the original MAML [7] across the tasks.
3 Smoothed GDA Algorithm and Its Convergence
Before we introduce the SmoothedGDA algorithm, we first define the stationary solution and the stationary solution of problem (1.1).
Definition 3.1
Let be the indicator functions of the sets and respectively. A pair is an solution set of problem (1.1) if there exists a pair such that
(3.1) 
where denotes the subgradient of a function . A pair is a stationary solution if .
Definition 3.2
The projection of a point onto a set is defined as .
3.1 Smoothed Gradient DescentAscent (SmoothedGDA)
A simple algorithm for solving minmax problems is the Gradient Descent Ascent (GDA) algorithm (Algorithm 1), which performs a gradient descent to the problem and a gradient ascent to the problem alternatively. It is wellknown that with constant step size, GDA can oscillate between iterates and fail to converge even for a simple bilinear minmax problem:
To fix the oscillation issue, we introduce a “smoothing” technique to the primal updates. Note that smoothing is a common technique in traditional optimization such as MoreauYosida smoothing [43] and Nesterov’s smoothing [38]. More concretely, we introduce an auxiliary sequence and define a function as
(3.2) 
where is a constant, and we perform gradient descent and gradient ascent alternatively on this function instead of the original function . After performing onestep of GDA to the function , is updated by an averaging step. The “Smoothed GDA” algorithm is formally presented in Algorithm 2. Note that our algorithm is different from the one in [52], as [52] uses an regularization term and requires this term to diminishing.
Notice that when , SmoothedGDA is just the standard GDA. Furthermore, if the variable has a block structure, i.e., can be decomposed into blocks as
then Algorithm 2 can be extended to a multiblock version which we call the Smoothed Block Gradient Descent Ascent (SmoothedBGDA) Algorithm (see Algorithm 3). In the multiblock version, we update the primal variable blocks alternatingly and use the same strategy to update the dual variable and the auxiliary variable as in the singleblock version.
3.2 Iteration Complexity for NonconvexConcave Problems
In this subsection, we present the iteration complexities of Algorithm 2 and Algorithm 3 for general nonconvexconcave problems (1.1). We first state some basic assumptions.
Assumption 3.3
We assume the following.

is smooth and the gradients are Lipschitz continuous.

is a closed, convex and compact set of . is a closed and convex set.

The function is bounded from below by some finite constant .
Theorem 3.4
Consider solving problem (1.1) by Algorithm 2 (or Algorithm 3). Suppose Assumption 3.3 holds, and we choose the algorithm parameters to satisfy and
(3.3) 
Then, the following holds:

(Oneblock case) For any integer , if we further let , then there exists a such that is a stationary solution. This means we can obtain an stationary solution within iterations.
Remark. The reference [29] derived the same iteration complexity of under the additional compactness assumption on . This assumption may not be satisfied for some applications where can the entire space.
3.3 Convergence Results for Minimizing the PointWise Maximum of Finite Functions
Now we state the improved iteration complexity results for the special minmax problem (1.2). We claim that our algorithms (Algorithm 2 and Algorithm 3) can achieve the optimal order of iteration complexity of in this case.
For any stationary solution of (1.2) denoted as , the following KKT conditions hold:
(3.5)  
(3.6)  
(3.7)  
(3.8)  
(3.9) 
where denotes the Jacobian matrix of at , while , are the multipliers for the equality constraint and the inequality constraint respectively.
At any stationary solution , only the functions for any index with contribute to the objective function and they correspond to the worst cases in the robust learning task. In other words, any function with at contains important information of the solution. We denote a set to represent the set of indices for which . We will make a mild assumption on this set.
Assumption 3.5
For any satisfying (3.5), we have .
Remark. The assumption is called “strict complementarity”, a common assumption in the field of variation inequality [21, 13] which is closely related to the study of minmax problems. This assumption is used in many other optimization papers [15, 6, 25, 35, 28]. Strict complementarity is generically true (i.e. holds with probability 1) if there is a linear term in the objective function and the data is from a continuous distribution (similar to [55, 28]). Moreover, we will show that we can prove Theorem 3.8 using a weaker regularity assumption rather than the strict complementarity assumption:
Assumption 3.6
For any , the matrix is of full column rank, where
We say that Assumption E.1 is weaker since the strict complementarity assumption (Assumption 3.5) can imply Assumption E.1 according to Lemma D.7 in the appendix. In the appendix, we will see that Assumption E.1 holds with probability for a robust regression problem with a square loss (see Proposition E.7).
We also make the following common “bounded level set” assumption.
Assumption 3.7
The set is bounded for any . Here .
Remark. This boundedlevelset assumption is to ensure the iterates would stay bounded. Actually, assuming the iterates are bounded will be enough for our proof. The bounded level set assumption, a.k.a. coerciveness assumption, is widely used in many papers [53, 5, 48]. Boundediteratesassumption itself is also common in optimization [51, 9, 6]. In practice, people usually add a regularizer to the objective function to make the level set and the iterates bounded (see [50] for a neural network example).
4 Proof Sketch
In this section, we give a proof sketch of the main theorem on the oneblock cases; the proof details will be given in the appendix.
4.1 The Potential Function and Basic Estimates
To analyze the convergence of the algorithms, we construct a potential function and study its behavior along the iterations. We first give the intuition why our algorithm works. We define the dual function and the proximal function as
We also let
Notice that by Danskin’s Theorem, we have and . Recall in Algorithm 2, the update for and can be respectively viewed as a primal descent for the function , approximating dual ascent to the dual function and approximating proximal descent to the proximal function . We define a potential function as follows:
(4.1) 
which is a linear combination of the primal function , the dual function and the proximal function . We hope the potential function decreases after each iteration and is bounded from below. In fact, it is easy to prove that for any (see appendix), but it is harder to prove the decrease of . Since the ascent for dual and the descent for proximal is approximate, an error term occurs when estimating the decrease of the potential function. Hence, certain error bounds are needed.
Using some primal error bounds, we have the following basic descent estimate.
We would like the potential function to decrease sufficiently after each iteration. Concretely speaking, we want to eliminate the negative term (4.3) and show that the following “sufficientdecrease” holds for each iteration :
(4.4) 
It is not hard to prove that if (4.4) holds for , then there exists a such that is a solution for some constant . Moreover, if (4.4) holds for any , then the iteration complexity is and we can also prove that every limit point of the iterates is a minmax solution. Therefore by the above analysis, the most important thing is to bound the term , which is related to the socall “dual error bound”.
If , then is the maximizer of over , and thus is the same as . A natural question is whether we can use the term to bound ? The answer is yes, and we have the following “dual error bound”.
Lemma 4.2
Using this lemma, we can prove Theorem 3.8. We choose sufficiently small, then when the residuals appear in (4.3) are large, we can prove that decreases sufficiently using the compactness of . When the residuals are small, the error bound Lemma 4.2 can be used to guarantee the sufficient decrease of . Therefore, (4.4) always holds, which yields Theorem 3.8. However, for the general nonconvexconcave problem 1.1, we can only have a “weaker” bound.
Lemma 4.3
Note that this is a nonhomogeneous error bound, which can help us bound the term only when is not too small. Therefore, we say it is “weaker” dual error bound. To obtain an stationary solution, we need to choose sufficiently small and proportional to . In this case, we can prove that if stops to decrease, we have already obtained an stationary solution by Lemma 4.3. By the remark after (4.4), we need iterations to obtain an stationary solution.
Remark. For the general nonconvexconcave problem (1.1), we need to choose proportional to and hence the iteration complexity is higher than the previous case. However, it is expected that for a concrete problem with some special structure, the “weaker” error bound Lemma 4.3 can be improved, as is the iteration complexity bound. This is left as a future work.
The proof sketch can be summarized in the following steps:

In Step 1, we introduce the potential function which is shown to be bounded below. To obtain the convergence rate of the algorithms, we want to prove the potential function can make sufficient decrease at every iterate , i.e., we want to show .
5 Numerical Results on Robust Neural Network Training
In this section, we apply the SmoothedGDA algorithm to train a robust neural network on MNIST data set against adversarial attacks [17, 24, 30]. The optimization formulation is
(5.1) 
where is the parameter of the neural network, the pair denotes the th data point, and is the perturbation added to data point . As (5.1) is difficult to solve directly, researchers [40] have proposed an approximation of (5.1) as the following nonconvexconcave problem, which is in the form of (1.2) we discussed before.
(5.2) 
where is a parameter in the approximation, and is an approximated attack on sample by changing the output of the network to label . The details of this formulation and the structure of the network in experiments are provided in the appendix.
Natural  [17]  [24]  

[30] with  98.58%  96.09%  94.82%  89.84%  94.64%  91.41%  78.67% 
[54] with  97.37%  95.47%  94.86%  79.04%  94.41%  92.69%  85.74% 
[54] with  97.21%  96.19%  96.17%  96.14%  95.01%  94.36%  94.11% 
[40] with  98.20%  97.04%  96.66%  96.23%  96.00%  95.17 %  94.22% 
SmoothedGDA with  98.89%  97.87%  97.23%  95.81%  96.71%  95.62%  94.51% 
Results: We compare our results with three algorithms from [30, 54, 40]. The references [30, 54] are two classical algorithms in adversarial training, while the recent reference [40] considers the same problem formulation as (1.2) and has an algorithm with iteration complexity. The accuracy of our formulation are summarized in Table 2 which shows that the formulation (1.2) leads to a comparable or slightly better performance to the other algorithms. We also compare the convergence on the loss function when using the SmoothedGDA algorithm and the one in [40]. In Figure 1, SmoothedGDA algorithm takes only 5 epochs to get the loss values below 0.2 while the algorithm proposed in [40] takes more than 14 epochs. In addition, the loss obtained from the SmoothedGDA algorithm has a smaller variance.
6 Conclusion
In this paper, we propose a simple singleloop algorithm for nonconvex minmax problems (1.1). For an important family of problems (1.2), the algorithm is even more efficient due to the dual error bound, and it is wellsuited for problems in largesize dimensions and distributed setting. The algorithmic framework is flexible, and hence in the future work, we can extend the algorithm to more practical problems and derive stronger error bounds to attain lower iteration complexity.
Broader Impact
In this paper, we propose a singleloop algorithm for minmax problem. This algorithm is easy to implemented and proved to be efficient in a family of nonconvex minimax problems and have good numerical behavior in robust training. This paper focuses on theoretical study of the algorithms. In industrial applications, several aspects of impact can be expected:

Save energy by improving efficiency. The trick developed in this paper has the potential to accelerate the training for machine learning problems involving a minimax problem such robust training for uncertain data, generative adversarial net(GAN) and AI for games. This means that the actual training time will decrease dramatically by using our algorithm. Training neural network is very energyconsuming, and reducing the training time can help the industries or companies to save energy.

Promote fairness. We consider minmax problems in this paper. A model that is trained under this framework will not allow poor performance on some objectives in order to boost performance on the others. Therefore, even if the training data itself is biased, the model will not allow some objectives to contribute heavily to minimizing the average loss due to the minmax framework. In other words, this framework promotes fairness, and model that is trained under this framework will provide fair solutions to the problems.

Provide flexible framework. Our algorithmic framework is flexible. Though in the paper, we only discuss some general formulation, our algorithm can be easily extended to many practical settings. For example, based on our general framework for multiblock problems, we can design algorithms efficiently solving problems with distributedly stored data, decentralized control or privacy concern. Therefore, our algorithm may have an impact on some popular big data applications such as distributed training, federated learning and so on.
Funding Disclosure
This research is supported by the leading talents of Guangdong Province program [Grant 00201501]; the National Science Foundation of China [Grant 61731018]; the Air Force Office of Scientific Research [Grant FA95501210396]; the National Science Foundation [Grant CCF 1755847]; Shenzhen Peacock Plan [Grant KQTD2015033114415450]; the Development and Reform Commission of Shenzhen Municipality; and Shenzhen Research Institute of Big Data.
In the appendix, we will give the proof of the main theorems. The appendix is organized as follows:

In section A, we list some notations used in the appendix.

In section B, we prove the two main theorems in oneblock case.

In section C, we briefly state the proof of the two main theorems in multiblock setting.

In section E, we see that the strict complementarity assumption can be relaxed to a weaker regularity assumption. We also prove that this weaker regularity assumption is generic for robust regression problems with square loss, i.e., we prove that our regularity assumption holds with probability if the data points are joint from a continuous distribution.

In the last section F, we give some more details about the experiment.
Appendix A Notations
We first list some notations which will be used in the appendix.

.

is a Euclidian ball of radius for proper dimension.

means the Euclidian distance from a point to a set .

For a vector , means the th component of . For a set , is the vector containing all components ’s with .

Let be a matrix and be an index set. Then represents the row submatrix of corresponding to the rows with index in .

For a matrix , is the smallest singular value of .

The projection of a point , onto a set is defined as .
Appendix B Proof of the two main theorems: oneblock case
In this section, we prove the two main theorems in oneblock case. The proof of the multiblock case is similar and will be given in the next section.
Proof Sketch.

In Step 1, we will introduce the potential function which is shown to be bounded below. To obtain the convergence rate of the algorithms, we want to prove the potential function can make sufficient decrease at every iterate , i.e., we want to show .
b.1 The potential function and basic estimate
Recall that the potential function is:
where
Also note that if , is strongly convex of with modular and is Lipschitzcontinuous of with a constant . We also use the following notations:

.

, .

The set .

.

.
First of all, we can prove that is bounded from below:
Lemma B.1
We have
Proof By the definition of and , we have
(B.1) 
Hence, we have
Next, we state some “error bounds”.
Lemma B.2
There exist constants independent of such that
(B.2)  
(B.3)  
(B.4)  
(B.5) 
for any and , where , , ..
Proof The proofs of (B.2), (B.3) and (B.5) are the same as those in Lemma 3.6 in [55] and hence omitted. We only need to prove (B.4). Using the strong convexity of of , we have
(B.6)  
(B.7) 
Moreover, using the concavity of of , we have
(B.8)  
Using the Lipschitzcontinuity of , we have
(B.9)  
(B.10)  
(B.11)  
(B.12) 
where the last inequality uses the Cauchyschwarz inequality and the Lipschitzcontinuity of of .
Let . Then (B.10) becomes
Hence, we only need to solve the above quadratic inequality. We have
where the first inequality is due to the AMGM inequality and the third inequality is because . Therefore
Hence, we can take and finish the proof.
The following lemma is a direct corollary of the above lemma:
Lemma B.3
The dual function is a differentiable function of with Lipschitz continuous gradient
and
with .
Remark.Note that if , then we have and hence
(B.13) 
Proof Using Danskin’s theorem in convex analysis [4], we know that is a differentiable function with
To prove the Lipschitzcontinuity, we have
</ 