Convergence Rate of for Optimistic Gradient and Extra-gradient Methods in Smooth Convex-Concave Saddle Point Problems
We study the iteration complexity of the optimistic gradient descent-ascent (OGDA) method and the extra-gradient (EG) method for finding a saddle point of a convex-concave unconstrained min-max problem. To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method. This is similar to the approach taken in [Nemirovski, 2004] which analyzes EG as an approximation of the ‘conceptual mirror prox’. In this paper, we highlight how gradients used in OGDA and EG try to approximate the gradient of the Proximal Point method. We then exploit this interpretation to show that both algorithms produce iterates that remain within a bounded set. We further show that the primal dual gap of the averaged iterates generated by both of these algorithms converge with a rate of . Our theoretical analysis is of interest as it provides a the first convergence rate estimate for OGDA in the general convex-concave setting. Moreover, it provides a simple convergence analysis for the EG algorithm in terms of function value without using compactness assumption.
Given a function , we consider finding a saddle point of the problem
where a saddle point of Problem (1) is defined as a pair that satisfies
for all . Throughout the paper, we assume that the function is convex-concave, i.e., for any , the function is a convex function of and for any , the function is a concave function of . This formulation arises in several areas, including zero-sum games [Basar and Olsder, 1999], robust optimization [Ben-Tal et al., 2009], robust control [Hast et al., 2013] and more recently in machine learning in the context of Generative Adversarial Networks (GANs) (see [Goodfellow et al., 2014] for an introduction to GANs and [Arjovsky et al., 2017] for the formulation of Wasserstein GANs).
Our goal in this paper is to analyze the convergence rate of some discrete-time gradient based optimization algorithms for finding a saddle point of Problem (1) in the convex-concave case. In particular, we focus on Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods because of their widespread use for training GANs (see [Daskalakis et al., 2018; Liang and Stokes, 2019]). EG method is a classical algorithm for solving saddle point problems introduced by Korpelevich . Its linear rate of convergence for smooth and strongly convex-strongly concave functions
In this paper, we provide a unified convergence analysis for establishing a sublinear convergence rate of in terms of the function value difference of the averaged iterates and a saddle point for both OGDA and EG for convex-concave saddle point problems. Our analysis holds for unconstrained problems and does not require boundedness of the feasible set, and it establishes rate results using the function value differences as used in [Nemirovski, 2004] (suitably redefined for an unconstrained feasible set, see Section 5). Therefore, we get convergence of the EG method in unconstrained spaces without using the modified termination (error) criterion proposed in [Monteiro and Svaiter, 2010]. The key idea of our approach is to view both OGDA and EG iterates as approximations of the iterates of the proximal point method that was first introduced by Martinet  and later studied by Rockafellar . We would like to add that the idea of interpreting EG as an approximation of the Proximal Point method was first studied in [Nemirovski, 2004]. He considers the conceptual mirror prox, which is similar to the proximal point method, and shows that the mirror prox algorithm (of which EG is a special case) provides a good implementable approximation to this method. Further, Monteiro and Svaiter  use a similar interpretation and propose the Hybrid Proximal Extragradient method to establish the convergence of EG in unbounded settings using a different convergence criteria. More recently, Mokhtari et al.  study both OGDA and EG as approximations of proximal point method and analyze these algorithms for bilinear and strongly convex-strongly concave problems.
More specifically, we first consider a proximal point method with error and establish some key properties of its iterates. We then focus on OGDA as an approximation of proximal point method and use this connection to show that the iterates of OGDA remain in a compact set. We incorporate this result to prove a sublinear convergence rate of for the primal-dual gap of the averaged iterates generated by the OGDA update. We next consider EG where two gradient pairs are used in each iteration, one to compute a midpoint and other to find the new iterate using the gradient of the midpoint. Our first step again is to show boundedness of the iterates generated by EG. We then approximate the evolution of the midpoints using a proximal point method and use this approximation to establish convergence rate for the function value of the averaged iterates generated by EG. As the convergence results of EG have already been established in papers including [Nemirovski, 2004] and Monteiro and Svaiter , we relegate the proofs of Lemmas and Theorems corresponding to EG to the Appendix.
Several recent papers have studied the convergence rate of OGDA and EG for the case when the objective function is bilinear or strongly convex-strongly concave. Daskalakis et al.  showed the convergence of the OGDA iterates to a neighborhood of the solution when the objective function is bilinear. Liang and Stokes  used a dynamical system approach to prove the linear convergence of the OGDA method for the special case when and the matrix is square and full rank. They also presented a linear convergence rate of the vanilla Gradient Ascent Descent (GDA) method when the objective function is strongly convex-strongly concave. Gidel et al.  considered a variant of the EG method, relating it to OGDA updates, and showed the linear convergence of the corresponding EG iterates in the case where is strongly convex-strongly concave (though without showing the convergence rate for the OGDA iterates). Optimistic gradient methods have also been studied in the context of convex online learning [Chiang et al., 2012; Rakhlin and Sridharan, 2013a, b].
Nedić and Ozdaglar  analyzed the (sub)Gradient Descent Ascent (GDA) algorithm for convex-concave saddle point problems when the (sub)gradients are bounded over the constraint set, showing a convergence rate of in terms of the function value difference of the averaged iterates and a saddle point.
Chambolle and Pock  focused on a particular case of the saddle point problem where the coupling term in the objective function is bilinear, i.e., with and convex functions. They proposed a proximal point based algorithm which converges at a rate and further showed linear convergence when the functions and are strongly convex. Chen et al.  proposed an accelerated variant of this algorithm when is smooth and showed an optimal rate of , where and are the smoothness parameters of and the norm of the linear operator respectively. When the functions and are strongly convex, primal-dual gradient-type methods converge linearly, as shown in Chen and Rockafellar ; Bauschke and Combettes . Further, Du and Hu  showed that GDA achieves a linear convergence rate in this linearly coupled setting when is convex and is strongly convex.
For the case that is strongly concave with respect to , but possibly nonconvex with respect to , Sanjabi et al.  provided convergence to a first-order stationary point using an algorithm that requires running multiple updates with respect to at each step.
Notation. Lowercase boldface denotes a vector and uppercase boldface denotes a matrix. We use to denote the Euclidean norm of vector . Given a multi-input function , its gradient with respect to and at points are denoted by and , respectively. We refer to the largest and smallest eigenvalues of a matrix by and , respectively.
In this section we present properties and notations used in our results.
A function is -smooth if it has -Lipschitz continuous gradients on , i.e., for any , we have
A continuously differentiable function is convex on if for any , we have
Further, is concave if is convex.
The pair is a saddle point of a convex-concave function , if for any and , we have
Throughout the paper, we will assume that the following conditions are satisfied.
The function is continuously differentiable in and . Further, for any , the function is a convex function of and for any , the function is a concave function of .
The gradient , is -Lipschitz with respect to and -Lipschitz with respect to and the gradient , is -Lipschitz with respect to and -Lipschitz with respect to , i.e.,
We define .
The solution set defined as
In the following sections, we present and analyze three different iterative algorithms for solving the saddle point problem introduced in (1). The iterates of these algorithms are denoted by . We denote the averaged (ergodic) iterates by , defined as follows:
In our convergence analysis, we use a variational inequality approach in which we define the vector as our decision variable and define the operator as
In the following lemma we characterize the properties of operator in (4) when the conditions in Assumptions 1 and 2 are satisfied. We would like to emphasize that the following lemma is well-known – see, e.g., Nemirovski  – and we state it for completeness.
According to Lemma 1, when is convex-concave and smooth, the operator defined in (4) is monotone and Lipschitz. The third result in Lemma 1 shows that any saddle point of problem (1) satisfies the first-order optimality condition, i.e , we have:
Before presenting our main results, we state the following well known result (see for example Nemirovski ) which will be used later in the analysis of OGDA and EG. We present the proof here for completeness.
3 Proximal point method with error
One of the classical algorithms studied for solving the saddle point problem in (1) is the Proximal Point (PP) method, introduced in Martinet  and studied in Rockafellar . The PP method generates the iterate which is defined as the unique solution to the saddle point problem
It can be verified that if the pair is the solution of problem (9), then and satisfy
Using the optimality conditions of the updates in (10) and (11) (which are necessary and sufficient since the problems in (10) and (11) are strongly convex and strongly concave, respectively), the update of the PP method for the saddle point problem in (1) can be written as
It is well-known that the proximal point method achieves a sublinear rate of when is the number of iterations for convex minimization and for solving monotone variational inequalities (see Güler [1991, 1992]; Bruck Jr ; Teboulle ; Nemirovski ). Note that Nemirovski  in fact analyzed the conceptual mirror prox (the proximal point method) as a building block to analyze the mirror-prox algorithm. For completeness, we present the convergence rate of the proximal point method for convex-concave saddle point problems in the following theorem (see Appendix A for the proof).
The result in Theorem 1 shows that by following the update of proximal point method the gap between the function value for the averaged iterates and the function value for a saddle point of the problem (1) approaches zero at a sublinear rate of .
Our goal is to provide similar convergence rate estimates for OGDA and EG using the fact that these two methods can be interpreted as approximate versions of the proximal point method. To do so, let us first rewrite the update of the proximal point method given in (3) as
where and the operator is defined in (4). In the following proposition, we establish a relation for the iterates of a proximal point method with error. This relation will be used later for our analysis of OGDA and EG methods.
Consider the sequence of iterates generated by the following update
where is a monotone and Lipschitz continuous operator, is an arbitrary vector, and is a positive constant. Then for any and for each we have
According to the update in (15), we can show that for any we have
We add and subtract the inner product to the right hand side and regroup the terms to obtain
Replacing with , we obtain
On rearranging the terms, we obtain the following inequality:
and the proof is complete. ∎
4 Optimistic Gradient Descent Ascent
In this section, we focus on analyzing the performance of optimistic gradient descent ascent (OGDA) for finding a saddle point of a general smooth convex-concave function. It has been shown that the OGDA method achieves the same iteration complexity as the proximal point method for both strongly convex-strongly concave and bilinear problems; see Liang and Stokes , Gidel et al. , Mokhtari et al. . However, its iteration complexity for a general smooth convex-concave case has not been established to the best of our knowledge. In this section, we show that the function value of the averaged iterate generated by the OGDA method converges to the function value at a saddle point at a rate of , which matches the convergence rate of the proximal point method shown in Theorem 1.
Given a stepsize , the OGDA method updates the iterates and for each as
with the initial conditions and . The main difference between the updates of OGDA in (4) and the gradient descent ascent (GDA) method is in the additional “momentum” terms and . This additional term makes the update of OGDA a better approximation to the update of the proximal point method compared to the update of the GDA; for more details we refer readers to Proposition 1 in Mokhtari et al. .
To establish the convergence rate of OGDA for convex-concave problems, we first illustrate the connection between the updates of proximal point method and OGDA. Note that using the definitions of the vector and the operator , we can rewrite the update of the OGDA algorithm at iteration as
Considering this expression, we can also write the update of OGDA as an approximation of the proximal point update, i.e.,
where the error vector is given by
To derive the convergence rate of OGDA for the unconstrained problem in (1), we first use the result in Proposition 2 to derive a result for the specific case of OGDA updates. We then show that the iterates generated by the OGDA method remain in a bounded set. This is done in the following lemma (Note that boundedness of OGDA iterates can be deduced from [Popov, 1980], whereas a result similar to Lemma 2(b) was shown in a recent independent paper by Malitsky and Tam ).
Let be the iterates generated by the optimistic gradient descent ascent (OGDA) method introduced in (22) with the initial conditions and (i.e. ). If Assumptions 1, 2, and 3 hold and the stepsize satisfies the condition , then:
(a) The iterates satisfy the following relation:
(b) The iterates stay within the compact set defined as
where is a saddle point of the problem defined in (1).
We add and subtract the inner product to the right hand side of the preceding relation to obtain
Note that can be upper bounded by
where the second inequality follows as and therefore . This completes the proof of Part (a) of the lemma. Now, taking the sum of the preceding relation from , we obtain
Now set , where , to obtain
Note that each term of the summand in the sum in the left is nonnegative due to monotonicity of and therefore the sum is also nonnegative. Further, we know that . Using these observations we can write
Using Lipschitz continuity of the operator (Lemma 1(b)) and Young’s inequality in the preceding relation, we have
Regrouping the terms gives us
Using the condition , it follows that for any iterate we have
and the claim in Part (b) follows. ∎
According to Lemma 2, the sequence of iterates generated by OGDA method stays within a closed and bounded convex set. We use this result to prove a sublinear convergence rate of for the function value of the averaged iterates generated by OGDA to the function value at a saddle point, for smooth and convex-concave functions in the following theorem.
Suppose Assumptions 1, 2 and 3 hold. Let be the iterates generated by the OGDA updates in (4). Let the initial conditions satisfy and . Consider the definition of the averaged iterates in (3) and the compact convex set in (26). If the stepsize satisfies the condition , then for all , we have
where and .