# Convergence Rate of O(1/k) for Optimistic Gradient and Extra-gradient Methods in Smooth Convex-Concave Saddle Point Problems

## Abstract

We study the iteration complexity of the optimistic gradient descent-ascent (OGDA) method and the extra-gradient (EG) method for finding a saddle point of a convex-concave unconstrained min-max problem. To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method. This is similar to the approach taken in [Nemirovski, 2004] which analyzes EG as an approximation of the ‘conceptual mirror prox’. In this paper, we highlight how gradients used in OGDA and EG try to approximate the gradient of the Proximal Point method. We then exploit this interpretation to show that both algorithms produce iterates that remain within a bounded set. We further show that the primal dual gap of the averaged iterates generated by both of these algorithms converge with a rate of . Our theoretical analysis is of interest as it provides a the first convergence rate estimate for OGDA in the general convex-concave setting. Moreover, it provides a simple convergence analysis for the EG algorithm in terms of function value without using compactness assumption.

## 1 Introduction

Given a function , we consider finding a saddle point of the problem

 minx∈Rmmaxy∈Rn f(x,y), (1)

where a saddle point of Problem (1) is defined as a pair that satisfies

 f(x∗,y)≤f(x∗,y∗)≤f(x,y∗)

for all . Throughout the paper, we assume that the function is convex-concave, i.e., for any , the function is a convex function of and for any , the function is a concave function of . This formulation arises in several areas, including zero-sum games [Basar and Olsder, 1999], robust optimization [Ben-Tal et al., 2009], robust control [Hast et al., 2013] and more recently in machine learning in the context of Generative Adversarial Networks (GANs) (see [Goodfellow et al., 2014] for an introduction to GANs and [Arjovsky et al., 2017] for the formulation of Wasserstein GANs).

Our goal in this paper is to analyze the convergence rate of some discrete-time gradient based optimization algorithms for finding a saddle point of Problem (1) in the convex-concave case. In particular, we focus on Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods because of their widespread use for training GANs (see [Daskalakis et al., 2018; Liang and Stokes, 2019]). EG method is a classical algorithm for solving saddle point problems introduced by Korpelevich [1976]. Its linear rate of convergence for smooth and strongly convex-strongly concave functions 1 and bilinear functions, i.e., (where is a square, full rank matrix), was established in Korpelevich [1976] as well as the variational inequality literature (see [Tseng, 1995] and [Facchinei and Pang, 2007]). Its convergence rate for the constrained convex-concave setting was first established by Nemirovski [2004] under the assumption that the feasible set is convex and compact.2 Monteiro and Svaiter [2010] established a similar convergence rate for EG without assuming compactness of the feasible set by using a new termination criterion that relies on enlargement of the operator of the VI reformulation of the saddle point problem defined in [Burachik et al., 1997]. OGDA was introduced by Popov [1980], as a variant of the Extragradient method, and has gained popularity recently due to its performance in training GANs (see [Daskalakis et al., 2018]). To the best of our knowledge, iteration complexity of OGDA for the convex-concave case has not been studied before.

In this paper, we provide a unified convergence analysis for establishing a sublinear convergence rate of in terms of the function value difference of the averaged iterates and a saddle point for both OGDA and EG for convex-concave saddle point problems. Our analysis holds for unconstrained problems and does not require boundedness of the feasible set, and it establishes rate results using the function value differences as used in [Nemirovski, 2004] (suitably redefined for an unconstrained feasible set, see Section 5). Therefore, we get convergence of the EG method in unconstrained spaces without using the modified termination (error) criterion proposed in [Monteiro and Svaiter, 2010]. The key idea of our approach is to view both OGDA and EG iterates as approximations of the iterates of the proximal point method that was first introduced by Martinet [1970] and later studied by Rockafellar [1976]. We would like to add that the idea of interpreting EG as an approximation of the Proximal Point method was first studied in [Nemirovski, 2004]. He considers the conceptual mirror prox, which is similar to the proximal point method, and shows that the mirror prox algorithm (of which EG is a special case) provides a good implementable approximation to this method. Further, Monteiro and Svaiter [2010] use a similar interpretation and propose the Hybrid Proximal Extragradient method to establish the convergence of EG in unbounded settings using a different convergence criteria. More recently, Mokhtari et al. [2020] study both OGDA and EG as approximations of proximal point method and analyze these algorithms for bilinear and strongly convex-strongly concave problems.

More specifically, we first consider a proximal point method with error and establish some key properties of its iterates. We then focus on OGDA as an approximation of proximal point method and use this connection to show that the iterates of OGDA remain in a compact set. We incorporate this result to prove a sublinear convergence rate of for the primal-dual gap of the averaged iterates generated by the OGDA update. We next consider EG where two gradient pairs are used in each iteration, one to compute a midpoint and other to find the new iterate using the gradient of the midpoint. Our first step again is to show boundedness of the iterates generated by EG. We then approximate the evolution of the midpoints using a proximal point method and use this approximation to establish convergence rate for the function value of the averaged iterates generated by EG. As the convergence results of EG have already been established in papers including [Nemirovski, 2004] and Monteiro and Svaiter [2010], we relegate the proofs of Lemmas and Theorems corresponding to EG to the Appendix.

### Related Work

Several recent papers have studied the convergence rate of OGDA and EG for the case when the objective function is bilinear or strongly convex-strongly concave. Daskalakis et al. [2018] showed the convergence of the OGDA iterates to a neighborhood of the solution when the objective function is bilinear. Liang and Stokes [2019] used a dynamical system approach to prove the linear convergence of the OGDA method for the special case when and the matrix is square and full rank. They also presented a linear convergence rate of the vanilla Gradient Ascent Descent (GDA) method when the objective function is strongly convex-strongly concave. Gidel et al. [2018] considered a variant of the EG method, relating it to OGDA updates, and showed the linear convergence of the corresponding EG iterates in the case where is strongly convex-strongly concave (though without showing the convergence rate for the OGDA iterates). Optimistic gradient methods have also been studied in the context of convex online learning [Chiang et al., 2012; Rakhlin and Sridharan, 2013a, b].

Nedić and Ozdaglar [2009] analyzed the (sub)Gradient Descent Ascent (GDA) algorithm for convex-concave saddle point problems when the (sub)gradients are bounded over the constraint set, showing a convergence rate of in terms of the function value difference of the averaged iterates and a saddle point.

Chambolle and Pock [2011] focused on a particular case of the saddle point problem where the coupling term in the objective function is bilinear, i.e., with and convex functions. They proposed a proximal point based algorithm which converges at a rate and further showed linear convergence when the functions and are strongly convex. Chen et al. [2014] proposed an accelerated variant of this algorithm when is smooth and showed an optimal rate of , where and are the smoothness parameters of and the norm of the linear operator respectively. When the functions and are strongly convex, primal-dual gradient-type methods converge linearly, as shown in Chen and Rockafellar [1997]; Bauschke and Combettes [2011]. Further, Du and Hu [2019] showed that GDA achieves a linear convergence rate in this linearly coupled setting when is convex and is strongly convex.

For the case that is strongly concave with respect to , but possibly nonconvex with respect to , Sanjabi et al. [2018] provided convergence to a first-order stationary point using an algorithm that requires running multiple updates with respect to at each step.

Notation.  Lowercase boldface denotes a vector and uppercase boldface denotes a matrix. We use to denote the Euclidean norm of vector . Given a multi-input function , its gradient with respect to and at points are denoted by and , respectively. We refer to the largest and smallest eigenvalues of a matrix by and , respectively.

## 2 Preliminaries

In this section we present properties and notations used in our results.

###### Definition 1.

A function is -smooth if it has -Lipschitz continuous gradients on , i.e., for any , we have

 ||∇ϕ(x)−∇ϕ(^x)||≤L||x−^x||.
###### Definition 2.

A continuously differentiable function is convex on if for any , we have

 ϕ(^x)≥ϕ(x)+∇ϕ(x)T(^x−x).

Further, is concave if is convex.

###### Definition 3.

The pair is a saddle point of a convex-concave function , if for any and , we have

Throughout the paper, we will assume that the following conditions are satisfied.

###### Assumption 1.

The function is continuously differentiable in and . Further, for any , the function is a convex function of and for any , the function is a concave function of .

###### Assumption 2.

The gradient , is -Lipschitz with respect to and -Lipschitz with respect to and the gradient , is -Lipschitz with respect to and -Lipschitz with respect to , i.e.,

 ∥∇xf(x1,y)−∇xf(x2,y)∥ ≤Lxx∥x1−x2∥for % ally, ∥∇xf(x,y1)−∇xf(x,y2)∥ ≤Lxy∥y1−y2∥for % allx, ∥∇yf(x,y1)−∇yf(x,y2)| ≤Lyy∥y1−y2∥for % allx, ∥∇yf(x1,y)−∇yf(x2,y)| ≤Lyx∥x1−x2∥for % ally.

We define . 3

###### Assumption 3.

The solution set defined as

 Z∗:={[x;y]∈Rn+m:(x,y) is a saddle point of Problem (???)}, (2)

is nonempty.

In the following sections, we present and analyze three different iterative algorithms for solving the saddle point problem introduced in (1). The iterates of these algorithms are denoted by . We denote the averaged (ergodic) iterates by , defined as follows:

 ^xk=1kk∑i=1xi,^yk=1kk∑i=1yi. (3)

In our convergence analysis, we use a variational inequality approach in which we define the vector as our decision variable and define the operator as

 F(z)=[∇xf(x,y);−∇yf(x,y)]. (4)

In the following lemma we characterize the properties of operator in (4) when the conditions in Assumptions 1 and 2 are satisfied. We would like to emphasize that the following lemma is well-known – see, e.g., Nemirovski [2004] – and we state it for completeness.

###### Lemma 1.

Let be defined as in Equation (4). Suppose Assumptions 1 and 2 hold. Then
(a) is a monotone operator, i.e., for any , we have

 ⟨F(z1)−F(z2),z1−z2⟩≥0.

(b) is an -Lipschitz continuous operator, i.e., for any , we have

 ∥F(z1)−F(z2)∥≤L∥z1−z2∥.

(c) For all , we have .

According to Lemma 1, when is convex-concave and smooth, the operator defined in (4) is monotone and Lipschitz. The third result in Lemma 1 shows that any saddle point of problem (1) satisfies the first-order optimality condition, i.e , we have:

 ∇xf(x∗,y∗)=0∇yf(x∗,y∗) (5)

Before presenting our main results, we state the following well known result (see for example Nemirovski [2004]) which will be used later in the analysis of OGDA and EG. We present the proof here for completeness.

###### Proposition 1.

Recall the definition of the operator in (4) and the points in (3). Suppose Assumptions 1 and 3 hold. Then for any , we have

 f(^xN,y)−f(x,^yN)≤1NN∑k=1F(zk)⊤(zk−z) (6)
###### Proof.

Using the definition of the operator , we can write

 1NN∑k=1F(zk)⊤(zk−z) =1NN∑k=1[∇xf(xk,yk)⊤(xk−x)+∇yf(xk,yk)⊤(y−yk)] =1NN∑k=1[f(xk,y)−f(x,yk)], (7)

where the inequality holds due to the fact that is convex-concave. Using convexity of with respect to and concavity of with respect to , we have

 (8)

Combining inequalities (7) and (8) yields

 1NN∑k=1F(zk)⊤(zk−z) ≥f(^xN,y)−f(x,^yN),

completing the proof. ∎

## 3 Proximal point method with error

One of the classical algorithms studied for solving the saddle point problem in (1) is the Proximal Point (PP) method, introduced in Martinet [1970] and studied in Rockafellar [1976]. The PP method generates the iterate which is defined as the unique solution to the saddle point problem4

 minx∈Rmmaxy∈Rn{f(x,y)+12η∥x−xk∥2−12η∥y−yk∥2}. (9)

It can be verified that if the pair is the solution of problem (9), then and satisfy

 xk+1 =\operatornamewithlimitsargminx∈Rm{f(x,yk+1)+12η∥x−xk∥2}, (10) yk+1 =\operatornamewithlimitsargmaxy∈Rn{f(xk+1,y)−12η∥y−yk∥2}. (11)

Using the optimality conditions of the updates in (10) and (11) (which are necessary and sufficient since the problems in (10) and (11) are strongly convex and strongly concave, respectively), the update of the PP method for the saddle point problem in (1) can be written as

 xk+1 =xk−η∇xf(xk+1,yk+1), yk+1 =yk+η∇yf(xk+1,yk+1). (12)

It is well-known that the proximal point method achieves a sublinear rate of when is the number of iterations for convex minimization and for solving monotone variational inequalities (see Güler [1991, 1992]; Bruck Jr [1977]; Teboulle [1997]; Nemirovski [2004]). Note that Nemirovski [2004] in fact analyzed the conceptual mirror prox (the proximal point method) as a building block to analyze the mirror-prox algorithm. For completeness, we present the convergence rate of the proximal point method for convex-concave saddle point problems in the following theorem (see Appendix A for the proof).

###### Theorem 1.

Suppose Assumption 1 holds. Let be the iterates generated by the updates in (3). Consider the definition of the averaged iterates in (3). Then for all , we have

 (13)

The result in Theorem 1 shows that by following the update of proximal point method the gap between the function value for the averaged iterates and the function value for a saddle point of the problem (1) approaches zero at a sublinear rate of .

Our goal is to provide similar convergence rate estimates for OGDA and EG using the fact that these two methods can be interpreted as approximate versions of the proximal point method. To do so, let us first rewrite the update of the proximal point method given in (3) as

 zk+1=zk−ηF(zk+1), (14)

where and the operator is defined in (4). In the following proposition, we establish a relation for the iterates of a proximal point method with error. This relation will be used later for our analysis of OGDA and EG methods.

###### Proposition 2.

Consider the sequence of iterates generated by the following update

 zk+1=zk−ηF(zk+1)+εk, (15)

where is a monotone and Lipschitz continuous operator, is an arbitrary vector, and is a positive constant. Then for any and for each we have

 F(zk+1)⊤(zk+1−z) =12η∥zk−z∥2−12η∥zk+1−z∥2−12η∥zk+1−zk∥2+1ηεk⊤(zk+1−z). (16)
###### Proof.

According to the update in (15), we can show that for any we have

 ∥zk+1−z∥2=∥zk−z∥2−2η(zk−z)⊤F(zk+1) +η2∥F(zk+1)∥2+∥εk∥2 (17)

We add and subtract the inner product to the right hand side and regroup the terms to obtain

 ∥zk+1−z∥2 =∥zk−z∥2−2η(zk+1−z)⊤F(zk+1)−2η(xk−xk+1)⊤F(zk+1) +η2∥F(zk+1)∥2+∥εk∥2+2εkT(zk−z−ηF(zk+1)). (18)

Replacing with , we obtain

 ∥zk+1−z∥2 =∥zk−z∥2−2η(zk+1−z)⊤F(zk+1)+2(zk−zk+1)⊤(zk+1−zk−εk) +∥zk+1−zk−εk∥2+∥εk∥2+2εkT(zk+1−z−εk) =∥zk−z∥2−2η(zk+1−z)⊤F(zk+1)−∥zk+1−zk∥2+2εkT(zk+1−z). (19)

On rearranging the terms, we obtain the following inequality:

 F(zk+1)⊤(zk+1−z) =12η∥zk−z∥2−12η∥zk+1−z∥2−12η∥zk+1−zk∥2+1ηεkT(zk+1−z), (20)

and the proof is complete. ∎

## 4 Optimistic Gradient Descent Ascent

In this section, we focus on analyzing the performance of optimistic gradient descent ascent (OGDA) for finding a saddle point of a general smooth convex-concave function. It has been shown that the OGDA method achieves the same iteration complexity as the proximal point method for both strongly convex-strongly concave and bilinear problems; see Liang and Stokes [2019], Gidel et al. [2018], Mokhtari et al. [2020]. However, its iteration complexity for a general smooth convex-concave case has not been established to the best of our knowledge. In this section, we show that the function value of the averaged iterate generated by the OGDA method converges to the function value at a saddle point at a rate of , which matches the convergence rate of the proximal point method shown in Theorem 1.

Given a stepsize , the OGDA method updates the iterates and for each as

 xk+1 =xk−2η∇xf(xk,yk)+η∇xf(xk−1,yk−1), yk+1 =yk+2η∇yf(xk,yk)−η∇yf(xk−1,yk−1) (21)

with the initial conditions and . The main difference between the updates of OGDA in (4) and the gradient descent ascent (GDA) method is in the additional “momentum” terms and . This additional term makes the update of OGDA a better approximation to the update of the proximal point method compared to the update of the GDA; for more details we refer readers to Proposition 1 in Mokhtari et al. [2020].

To establish the convergence rate of OGDA for convex-concave problems, we first illustrate the connection between the updates of proximal point method and OGDA. Note that using the definitions of the vector and the operator , we can rewrite the update of the OGDA algorithm at iteration as

 zk+1=zk−2ηF(zk)+ηF(zk−1). (22)

Considering this expression, we can also write the update of OGDA as an approximation of the proximal point update, i.e.,

 zk+1=zk−ηF(zk+1)+εk, (23)

where the error vector is given by

 εk=η[(F(zk+1)−F(zk))−(F(zk)−F(zk−1))]. (24)

To derive the convergence rate of OGDA for the unconstrained problem in (1), we first use the result in Proposition 2 to derive a result for the specific case of OGDA updates. We then show that the iterates generated by the OGDA method remain in a bounded set. This is done in the following lemma (Note that boundedness of OGDA iterates can be deduced from [Popov, 1980], whereas a result similar to Lemma 2(b) was shown in a recent independent paper by Malitsky and Tam [2018]).

###### Lemma 2.

Let be the iterates generated by the optimistic gradient descent ascent (OGDA) method introduced in (22) with the initial conditions and (i.e. ). If Assumptions 12, and 3 hold and the stepsize satisfies the condition , then:
(a) The iterates satisfy the following relation:

 F(zk+1)⊤(zk+1−z) ≤12η∥zk−z∥2−12η∥zk+1−z∥2−L2∥zk+1−zk∥2+L2∥zk−zk−1∥2 +(F(zk+1)−F(zk))⊤(zk+1−z)−(F(zk)−F(zk−1))⊤(zk−z). (25)

(b) The iterates stay within the compact set defined as

 D:={(x,y)∣∥x−x∗∥2+∥y−y∗∥2≤2(∥x0−x∗∥2+∥y0−y∗∥2)}, (26)

where is a saddle point of the problem defined in (1).

###### Proof.

Since OGDA iterates satisfy Equation (23) with the error vector given in Equation (24), using Proposition 2 with this error vector leads to

 F(zk+1)⊤(zk+1−z) =12η∥zk−z∥2−12η∥zk+1−z∥2−12η∥zk+1−zk∥2 (27)

We add and subtract the inner product to the right hand side of the preceding relation to obtain

 F(zk+1)⊤(zk+1−z) =12η∥zk−z∥2−12η∥zk+1−z∥2−12η∥zk+1−zk∥2 (28)

Note that can be upper bounded by

 (F(zk)−F(zk−1))⊤(zk−zk+1) ≤∥F(zk)−F(zk−1)∥∥zk−zk+1∥ ≤L∥zk−zk−1∥∥zk−zk+1∥ ≤L2∥zk−zk−1∥2+L2∥zk−zk+1∥2, (29)

where the second inequality holds due to Lipschitz continuity of the operator (Lemma 1(b)) and the last inequality holds due to Young’s inequality.5 Replacing in (4) by its upper bound in (4) yields

 F(zk+1)⊤(zk+1−z) ≤12η∥zk−z∥2−12η∥zk+1−z∥2−12η∥zk+1−zk∥2 +(F(zk+1)−F(zk))⊤(zk+1−z)−(F(zk)−F(zk−1))⊤(zk−z) +L2∥zk−zk−1∥2+L2∥zk+1−zk∥2 ≤12η∥zk−z∥2−12η∥zk+1−z∥2−L2∥zk+1−zk∥2+L2∥zk−zk−1∥2 +(F(zk+1)−F(zk))⊤(zk+1−z)−(F(zk)−F(zk−1))⊤(zk−z), (30)

where the second inequality follows as and therefore . This completes the proof of Part (a) of the lemma. Now, taking the sum of the preceding relation from , we obtain

 N−1∑k=0F(zk+1)⊤(zk+1−z) ≤12η∥z0−z∥2−12η∥zN−z∥2−L2∥zN−zN−1∥2+L2∥z0−z−1∥2 +(F(zN)−F(zN−1))⊤(zN−z)−(F(z0)−F(z−1))⊤(z0−z). (31)

Now set , where , to obtain

 N−1∑k=0F(zk+1)⊤(zk+1−z∗) +(F(zN)−F(zN−1))⊤(zN−z∗)−(F(z0)−F(z−1))⊤(z0−z∗). (32)

Note that each term of the summand in the sum in the left is nonnegative due to monotonicity of and therefore the sum is also nonnegative. Further, we know that . Using these observations we can write

 0 +(F(zN)−F(zN−1))⊤(zN−z∗). (33)

Using Lipschitz continuity of the operator (Lemma 1(b)) and Young’s inequality in the preceding relation, we have

 0 +L∥zN−zN−1∥∥zN−z∗∥ +L2∥zN−zN−1∥2+L2∥zN−z∗∥2 (34)

Regrouping the terms gives us

 ∥zN−z∗∥2≤1(1−ηL)∥z0−z∗∥2. (35)

Using the condition , it follows that for any iterate we have

 ∥zN−z∗∥2≤2∥z0−z∗∥2, (36)

and the claim in Part (b) follows. ∎

According to Lemma 2, the sequence of iterates generated by OGDA method stays within a closed and bounded convex set. We use this result to prove a sublinear convergence rate of for the function value of the averaged iterates generated by OGDA to the function value at a saddle point, for smooth and convex-concave functions in the following theorem.

###### Theorem 2.

Suppose Assumptions 12 and 3 hold. Let be the iterates generated by the OGDA updates in (4). Let the initial conditions satisfy and . Consider the definition of the averaged iterates in (3) and the compact convex set in (26). If the stepsize satisfies the condition , then for all , we have

 [maxy:(^xN,y)∈Df(^xN,y)−f⋆]+[f⋆−minx:(x,^yN)∈Df(x,^yN)]≤D(8L+12η)N, (37)

where and .

###### Proof.

From Lemma 2(a), we have that the iterates generated by the OGDA method satisfy Equation (2). On taking the sum of this relation from , we obtain for any

 N−1∑k=0F(zk+1)⊤(zk+1−z) ≤12η∥z0−z∥2−12η∥zN−z∥2−L2∥zN−zN−1∥2+L2∥z0−z−1∥2 +(F(zN)−F(zN−1))⊤(zN−z)−(F(z0)−F(z−1))⊤(z0−z). (38)

Note that for any , we have:

 ∥z1−z2∥2 ≤2∥z1−z∗∥2+2∥z2−z∗∥2 ≤4∥z0−z∗∥2+4∥z0−z∗∥2 ≤8D (39)

where we have used the fact that along with the fact that . As and , for any we have

 1NN−1∑k=0F(zk+1)⊤(zk+1−z)