On the Convergence of Single-Call Stochastic Extra-Gradient Methods
Variational inequalities have recently attracted considerable interest in machine learning as a flexible paradigm for models that go beyond ordinary loss function minimization (such as generative adversarial networks and related deep learning systems). In this setting, the optimal convergence rate for solving smooth monotone variational inequalities is achieved by the Extra-Gradient (EG) algorithm and its variants. Aiming to alleviate the cost of an extra gradient step per iteration (which can become quite substantial in deep learning applications), several algorithms have been proposed as surrogates to Extra-Gradient with a single oracle call per iteration. In this paper, we develop a synthetic view of such algorithms, and we complement the existing literature by showing that they retain a ergodic convergence rate in smooth, deterministic problems. Subsequently, beyond the monotone deterministic case, we also show that the last iterate of single-call, stochastic extra-gradient methods still enjoys a local convergence rate to solutions of non-monotone variational inequalities that satisfy a second-order sufficient condition.
OXOt#1_#2 \NewDocumentCommand\interOXOt#1_#2+1/2 \NewDocumentCommand\updateOXOt#1_#2+1 \NewDocumentCommand\lastOXOt#1_#2-1 \NewDocumentCommand\pastOXOt#1_#2-1/2 \NewDocumentCommand\pastpastOXOt#1_#2-3/2 \NewDocumentCommand\futureOXOt#1_#2+3/2 \NewDocumentCommand\lastlastOXOt#1_#2-2
Deep learning is arguably the fastest-growing field in artificial intelligence: its applications range from image recognition and natural language processing to medical anomaly detection, drug discovery, and most fields where computers are required to make sense of massive amounts of data. In turn, this has spearheaded a prolific research thrust in optimization theory with the twofold aim of demystifying the successes of deep learning models and of providing novel methods to overcome their failures.
Introduced by GPAM+14, generative adversarial networks have become the youngest torchbearers of the deep learning revolution and have occupied the forefront of this drive in more ways than one. First, the adversarial training of deep neural nets has given rise to new challenges regarding the efficient allocation of parallelizable resources, the compatibility of the chosen architectures, etc. Second, the loss landscape in GANs is no longer that of a minimization problem but that of a zero-sum, min-max game – or, more generally, a variational inequality (VI).
Variational inequalities are a flexible and widely studied framework in optimization which, among others, incorporates minimization, saddle-point, Nash equilibrium, and fixed point problems.
As such, there is an extensive literature devoted to solving variational inequalities in different contexts;
for an introduction, see [FP03, BC17] and references therein.
In particular, in the setting of monotone variational inequalities with Lipschitz continuous operators, it is well known that the optimal rate of convergence is , and that this rate is achieved by the EG algorithm of Kor76 and its Bregman variant, the Mirror-Prox (MP) algorithm of Nem04.
These algorithms require two projections and two oracle calls per iteration, so they are more costly than standard Forward-Backward / descent methods. As a result, there are two complementary strands of literature aiming to reduce one (or both) of these cost multipliers – that is, the number of projections and/or the number of oracle calls per iteration. The first class contains algorithms like the Forward-Backward-Forward (FBF) method of Tse00, while the second focuses on gradient extrapolation mechanisms like Popov’s modified Arrow–Hurwicz algorithm [Pop80].
In deep learning, the latter direction has attracted considerably more interest than the former. The main reason for this is that neural net training often does not involve constraints (and, when it does, they are relatively cheap to handle). On the other hand, gradient calculations can become very costly, so a decrease in the number of oracle calls could offer significant practical benefits. In view of this, our aim in this paper is (i) to develop a synthetic approach to methods that retain the anticipatory properties of the Extra-Gradient algorithm while making a single oracle call per iteration; and (ii) to derive quantitative convergence results for such single-call extra-gradient (-EG) algorithms.
|Ergodic||Last Iterate||Ergodic||Last Iterate|
|Deterministic||\Ovalbox||Unknown||[GBVV+19, MOP19a, Mal15]|
Our first contribution complements the existing literature (reviewed below and in Section 3) by showing that the class of single-call extra-gradient (-EG) algorithms under study attains the optimal convergence rate of the two-call method in deterministic variational inequalities with a monotone, Lipschitz continuous operator. Subsequently, we show that this rate is also achieved in stochastic variational inequalities with strongly monotone operators provided that the optimizer has access to an oracle with bounded variance (but not necessarily bounded second moments).
Importantly, this stochastic result concerns both the method’s “ergodic average” (a weighted average of the sequence of points generated by the algorithm) as well as its “last iterate” (the last generated point). The reason for this dual focus is that averaging can be very useful in convex/monotone landscapes, but it is not as beneficial in non-monotone problems (where Jensen’s inequality does not apply). On that account, last-iterate convergence results comprise an essential stepping stone for venturing beyond monotone problems.
Armed with these encouraging results, we then focus on non-monotone problems and show that, with high probability, the method’s last iterate exhibits a local convergence rate to solutions of non-monotone variational inequalities that satisfy a second-order sufficient condition. To the best of our knowledge, this is the first convergence rate guarantee of this type for stochastic, non-monotone variational inequalities.
The prominence of Extra-Gradient/Mirror-Prox methods in solving variational inequalities and saddle-point problems has given rise to a vast corpus of literature which we cannot hope to do justice here. Especially in the context of adversarial networks, there has been a flurry of recent activity relating variants of the Extra-Gradient algorithm to GAN training, see e.g., [DISZ18, YSXJ+18, GBVV+19, GHPL+19, MLZF+19, CGFLJ19, LS19] and references therein. For concreteness, we focus here on algorithms with a single-call structure and refer the reader to Sections 5, 4 and 3 for additional details.
The first variant of Extra-Gradient with a single oracle call per iteration dates back to Pop80.
This algorithm was subsequently studied by, among others, CYLM+12, RS13-NIPS, RS13-COLT and GBVV+19;
see also [Mal15, CS16] for a “reflected” variant, [DISZ18, PDZC19, MOP19a, MOP19b] for an “optimistic” one, and Section 3 for a discussion of the differences between these variants.
In the context of deterministic, strongly monotone variational inequalities with Lipschitz continuous operators, the last iterate of the method was shown to exhibit a geometric convergence rate [Tse95, GBVV+19, Mal15, MOP19a];
similar geometric convergence results also extend to bilinear saddle-point problems [Tse95, GBVV+19, PDZC19], even though the operator involved is not strongly monotone.
In turn, this implies the convergence of the method’s ergodic average, but at a rate (because of the hysteresis of the average).
In view of this, the fact that -EG methods retain the optimal convergence rate in deterministic variational inequalities without strong monotonicity assumptions closes an important gap in the literature.
At the local level, the geometric convergence results discussed above echo a surge of interest in local convergence guarantees of optimization algorithms applied to games and saddle-point problems, see e.g., [LS19, pmlr-v89-adolphs19a, daskalakis2018limit, pmlr-v80-balduzzi18a] and references therein. In more detail, LS19 proved local geometric convergence for several algorithms in possibly non-monotone saddle-point problems under a local smoothness condition. In a similar vein, daskalakis2018limit analyzed the limit points of (optimistic) gradient descent, and showed that local saddle points are stable stationary points; subsequently, pmlr-v89-adolphs19a and mazumdar2019finding proposed a class of algorithms that eliminate stationary points which are not local Nash equilibria.
Geometric convergence results of this type are inherently deterministic because they rely on an associated resolvent operator being firmly nonexpansive – or, equivalently, rely on the use of the center manifold theorem. In a stochastic setting, these techniques are no longer applicable because the contraction property cannot be maintained in the presence of noise; in fact, unless the problem at hand is amenable to variance reduction – e.g., as in [IJOT17, BMSV19, CGFLJ19] – geometric convergence is not possible if the noise process is even weakly isotropic. Instead, for monotone problems, CS16 and GBVV+19 showed that the ergodic average of the method attains a convergence rate. Our global convergence results for stochastic variational inequalities improve this rate to in strongly monotone variational inequalities for both the method’s ergodic average and its last iterate. In the same light, our local convergence results for non-monotone variational inequalities provide a key extension of local, deterministic convergence results to a fully stochastic setting, all the while retaining the fastest convergence rate for monotone variational inequalities.
For convenience, our contributions relative to the state of the art are summarized in Table 1.
2 Problem setup and blanket assumptions
We begin by presenting the basic variational inequality framework that we will consider throughout the sequel. To that end, let be a nonempty closed convex subset of , and let be a single-valued operator on . In its most general form, the variational inequality (VI) problem associated to and can be stated as:
To provide some intuition about (VI), we discuss two important examples below:
Example 1 (Loss minimization).
Suppose that for some smooth loss function on . Then, is a solution to (VI) if and only if , i.e., if and only if is a critical point of . Of course, if is convex, any such solution is a global minimizer.∎
Example 2 (Min-max optimization).
Suppose that decomposes as with , , and assume for some smooth function , , . As in Example 1 above, the solutions to (VI) correspond to the critical points of ; if, in addition, is convex-concave, any solution of (VI) is a global saddle-point, i.e.,
Given the original formulation of GANs as (stochastic) saddle-point problems [GPAM+14], this observation has been at the core of a vigorous literature at the interface between optimization, game theory, and deep learning, see e.g., [DISZ18, YSXJ+18, MLZF+19, GBVV+19, PDZC19, LS19, CGFLJ19] and references therein.∎
The operator analogue of convexity for a function is monotonicity, i.e.,
Specifically, when for some sufficiently smooth function , this condition is equivalent to being convex [BC17]. In this case, following Nes07, Nes09 and JNT11, the quality of a candidate solution can be assessed via the so-called error (or merit) function
|and/or its restricted variant|
where denotes the “restricted domain” of the problem. More precisely, we have the following basic result.
Lemma 1 (Nes07, Nes07).
In light of this result, and will be among our principal measures of convergence in the sequel.
With all this in hand, we present below the main assumptions that will underlie the bulk of the analysis to follow.
The solution set of (VI) is nonempty.
The operator is -Lipschitz continuous, i.e.,
The operator is monotone.
In some cases, we will also strengthen Assumption 3 to:
The operator is -strongly monotone, i.e.,
Throughout our paper, we will be interested in sequences of points generated by algorithms that can access the operator via a stochastic oracle [Nes04].
where is an additive noise variable satisfying the following hypotheses: \cref@addtoresetequationparentequation
In the above, denotes the history (natural filtration) of , so is adapted to by definition; on the other hand, since the -th instance of is generated randomly from , is not adapted to . Obviously, if , we have the deterministic, perfect feedback case .
The Extra-Gradient algorithm.
In the general framework outlined in the previous section, the Extra-Gradient (EG) algorithm of Kor76 can be stated in recursive form as
where denotes the Euclidean projection of onto the closed convex set and is a variable step-size sequence. Using this formulation as a starting point, the main idea behind the method can be described as follows: at each , the oracle is called at the algorithm’s current – or base – state to generate an intermediate – or leading – state ; subsequently, the base state is updated to using gradient information from the leading state , and the process repeats. Heuristically, the extra oracle call allows the algorithm to “anticipate” the landscape of and, in so doing, to achieve improved convergence results relative to standard projected gradient / forward-backward methods; for a detailed discussion, we refer the reader to [FP03, Bub15] and references therein.
Single-call variants of the Extra-Gradient algorithm.
Given the significant computational overhead of gradient calculations, a key desideratum is to drop the second oracle call in (EG) while retaining the algorithm’s “anticipatory” properties. In light of this, we will focus on methods that perform a single oracle call at the leading state , but replace the update rule for (and, possibly, as well) with a proxy that compensates for the missing gradient. Concretely, we will examine the following family of single-call extra-gradient (-EG) algorithms:
Past Extra-Gradient (PEG) [Pop80, CYLM+12, GBVV+19]:
(PEG) [Proxy: use instead of in the calculation of ] (RG) [Proxy: use instead of in the calculation of ; no projection] (OG)
[Proxy: use instead of in the calculation of ; use instead of in the calculation of ; no projection]
These are the main algorithmic schemes that we will consider, so a few remarks are in order. First, given the extensive literature on the subject, this list is not exhaustive; see e.g., [MOP19a, MOP19b, PDZC19] for a generalization of (OG), [Mal19] for a variant that employs averaging to update the algorithm’s base state , and [GHPL+19] for a proxy defined via “negative momentum”. Nevertheless, the algorithms presented above appear to be the most widely used single-call variants of (EG), and they illustrate very clearly the two principal mechanisms for approximating missing gradients: (i ) using past gradients (as in the Past Extra-Gradient (PEG) and Optimistic Gradient (OG) variants); and/or (ii ) using a difference of successive states (as in the Reflected Gradient (RG) variant).
We also take this opportunity to provide some background and clear up some issues on terminology regarding the methods presented above. First, the idea of using past gradients dates back at least to Pop80, who introduced (PEG) as a “modified Arrow–Hurwicz” method a few years after the original paper of Kor76; the same algorithm is called “meta” in [CYLM+12] and “extrapolation from the past” in [GBVV+19] (but see also the note regarding optimism below). The terminology “Reflected Gradient” and the precise formulation that we use here for (RG) is due to Mal15. The well-known primal-dual algorithm of ChaPoc11 can be seen as a one-sided, alternating variant of the method for saddle-point problems; see also [YSXJ+18] for a more recent take.
Finally, the terminology “optimistic” is due to RS13-COLT, RS13-NIPS, who provided a unified view of (PEG) and (EG) based on the sequence of oracle vectors used to update the algorithm’s leading state .
The above shows that there can be a broad array of single-call extra-gradients methods depending on the specific proxy used to estimate the missing gradient, whether it is applied to the algorithm’s base or leading state, when (or where) a projection operator is applied, etc. The contact point of all these algorithms is the unconstrained setting () where they are exactly equivalent:
Suppose that the -EG methods presented above share the same initialization, , , and are run with the same, constant step-size for all . If , the generated iterates coincide for all .
4 Deterministic analysis
We begin with the deterministic analysis, i.e., when the optimizer receives oracle feedback of the form (7) with . In terms of presentation, we keep the global and local cases separated and we interleave our results for the generated sequence and its ergodic average. To streamline our presentation, we defer the details of the proofs to the paper’s supplement and only discuss here the main ideas.
4.1 Global convergence
Our first result below shows that the algorithms under study achieve the optimal ergodic convergence rate in monotone problems with Lipschitz continuous operators.
Suppose that satisfies Assumptions 3, 2 and 1. Assume further that a -EG algorithm is run with perfect oracle feedback and a constant step-size , where for the RG variant and for the PEG and OG variants. Then, for all , we have
where is the ergodic average of the algorithm’s sequence of leading states.
This result shows that the EG and -EG algorithms share the same convergence rate guarantees, so we can safely drop one gradient calculation per iteration in the monotone case. The proof of the theorem is based on the following technical lemma which enables us to treat the different variants of the -EG method in a unified way.
Assume that satisfies Assumption 3 (monotonicity). Suppose further that the sequence of points in satisfies the following “quasi-descent” inequality with :
for all and all . Then,
The use of Lemma 2 is tailored to time-averaged sequences like , and relies on establishing a suitable “quasi-descent inequality” of the form (10) for the iterates of -EG. Doing this requires in turn a careful comparison of successive iterates of the algorithm via the Lipschitz continuity assumption for ; we defer the precise treatment of this argument to the paper’s supplement.
On the other hand, because the role of averaging is essential in this argument, the convergence of the algorithm’s last iterate requires significantly different techniques. To the best of our knowledge, there are no comparable convergence rate guarantees for under Assumptions 3, 2 and 1; however, if Assumption 3 is strengthened to Assumption 3(s), the convergence of to the (necessarily unique) solution of (VI) occurs at a geometric rate. For completeness, we state here a consolidated version of the geometric convergence results of Mal15, GBVV+19, and MOP19a.
4.2 Local convergence
We continue by presenting a local convergence result for deterministic, non-monotone problems. To state it, we will employ the following notion of regularity in lieu of Assumptions 3, 2 and 1 and 3(s).
We say that is a regular solution of (VI) if is -smooth in a neighborhood of and the Jacobian is positive-definite along rays emanating from , i.e.,
for all that are tangent to at .
This notion of regularity is an extension of similar conditions that have been employed in the local analysis of loss minimization and saddle-point problems. More precisely, if for some loss function , this definition is equivalent to positive-definiteness of the Hessian along qualified constraints [bertsekas1997nonlinear, Chap. 3.2]. As for saddle-point problems and smooth games, variants of this condition can be found in several different sources, see e.g., [Ros65, FK07, MZ19, ratliff2013characterization, LS19] and references therein.
Under this condition, we obtain the following local geometric convergence result for -EG methods.
Let be a regular solution of (VI). If a -EG method is run with perfect oracle feedback and is initialized sufficiently close to with a sufficiently small constant step-size,we have for some .
The proof of this theorem relies on showing that (i ) essentially behaves like a smooth, strongly monotone operator close to ; and (ii ) if the method is initialized in a small enough neighborhood of , it will remain in said neighborhood for all . As a result, Theorem 4 essentially follows by “localizing” Theorem 2 to this neighborhood.
As a preamble to our stochastic analysis in the next section, we should state here that, albeit straightforward, the proof strategy outlined above breaks down if we have access to only via a stochastic oracle. In this case, a single “bad” realization of the feedback noise could drive the process away from the attraction region of any local solution of (VI). For this reason, the stochastic analysis requires significantly different tools and techniques and is considerably more intricate.
5 Stochastic analysis
We now present our analysis for stochastic variational inequalities with oracle feedback of the form (7). For concreteness, given that the PEG variant of the -EG method employs the most straightforward proxy mechanism, we will focus on this variant throughout; for the other variants, the proofs and corresponding explicit expressions follow from the same rationale (as in the case of Theorem 1).
5.1 Global convergence
As we mentioned in the introduction, under Assumptions 3, 2 and 1, CS16 and GBVV+19 showed that -EG methods attain a ergodic convergence rate. By strengthening Assumption 3 to Assumption 3(s), we show that this result can be augmented in two synergistic ways: under Assumptions 3(s), 2 and 1, both the last iterate and the ergodic average of -EG achieve a convergence rate.
Regarding our proof strategy for the last iterate of the process, we can no longer rely either on a contraction argument or the averaging mechanism that yields the ergodic convergence rate. Instead, we show in the appendix that is (stochastically) quasi-Fejér in the sense of [Com01, CP15]; then, leveraging the method’s specific step-size, we employ successive numerical sequence estimates to control the summability error and obtain the rate.
5.2 Local convergence
We proceed to examine the convergence of the method in the stochastic, non-monotone case. Our main result in this regard is the following.
Let be a regular solution of (VI) and fix a tolerance level .
Suppose further that (PEG) is run with
stochastic oracle feedback of the form (7)
a variable step-size of the form for some and large enough .
There are neighborhoods and of in such that, if , the event
occurs with probability at least .
Conditioning on the above, we have:
where and .
The finiteness of and the positivity of are both consequences of the regularity of and their values only depend on the size of the neighborhood . Taking a larger would increase the algorithm’s certified initialization basin but it would also negatively impact its convergence rate (since would increase while would decrease). Likewise, the neighborhood only depends on the size of and, as we explain in the appendix, it suffices to take to be “one fourth” of .
From the above, it becomes clear that the situation is significantly more involved than the corresponding deterministic analysis. This is also reflected in the proof of Theorem 6 which requires completely new techniques, well beyond the straightforward localization scheme underlying Theorem 4. More precisely, a key step in the proof (which we detail in the appendix) is to show that the iterates of the method remain close to for all with arbitrarily high probability. In turn, this requires showing that the probability of getting a string of “bad” noise realizations of arbitrary length is controllably small. Even then however, the global analysis still cannot be localized because conditioning changes the probability law under which the oracle noise is unbiased. Accounting for this conditional bias requires a surprisingly delicate probabilistic argument which we also detail in the supplement.
6 Concluding remarks
Our aim in this paper was to provide a synthetic view of single-call surrogates to the Extra-Gradient algorithm, and to establish optimal convergence rates in a range of different settings – deterministic, stochastic, and/or non-monotone. Several interesting avenues open up as a result, from extending the theory to more general Bregman proximal settings, to developing an adaptive version as in the recent work [BL19] for two-call methods. We defer these research directions to future work.
This work benefited from financial support by MIAI Grenoble Alpes (Multidisciplinary Institute in Artificial Intelligence). P. Mertikopoulos was partially supported by the French National Research Agency (ANR) grant ORACLESS (ANR–16–CE33–0004–01) and the EU COST Action CA16228 “European Network for Game Theory” (GAMENET).
OXOt#1_#2+12 \RenewDocumentCommand\pastOXOt#1_#2-12 \RenewDocumentCommand\pastpastOXOt#1_#2-32 \RenewDocumentCommand\futureOXOt#1_#2+32
Appendix A Technical lemmas
Let and be a closed convex set. We set . For all , we have
Since , we have the following property , leading to
Let and be two closed convex sets. We set and .
If , for all , it holds
If , for all , it holds
Lemma A.3 (Chu54).
Let be a sequence of real numbers and such that for all ,
where and . Then,
For the sake of completeness, we provide a basic proof for the above lemma (which is a direct corollary of Chu54). Let and , we have
This shows that for any
Let us define . (A.12) becomes
This inequality holds for all . Then, either:
• becomes non-positive for some , and (A.13) implies that this is also the case for all subsequent , which leads to
• or is positive for all and we get
In both cases, (A.9) is verified. ∎
Let be a regular solution of (VI). Then, there exists constants such that is -Lipschitz continuous on and for all .
The Lipschitz continuity is straightforward: a -smooth operator is necessarily locally Lipschitz and thus Lipshitz on every compact. The proof consists in establishing the existence of . To this end, we consider the following function:
where denotes the tangent cone to at . The function is concave as it is defined as a pointwise minimum over a set of linear functions. This in turn implies the continuity because every concave function is continous on the interior of its effective domain. The solution being regular, we have . Combined with the continuity of in a neighborhood of , we deduce the existence of such that for all . Now let . It holds:
Consequently, writing , , we have
Finally, since is a solution of (VI), we have and
This ends the proof. ∎
Appendix B Proofs for the deterministic setting
b.1 Proof of 2
b.2 Proof of Theorem 1
To facilitate analysis and presentation of our results, (PEG) and (OG) are initialized with random and in while for (RG) we start with and . We are constrained to have different initial states in (RG) due to its specific formulation.
The theorem is immediate from Lemma 2 if we know that (10) is verified by the generated iterates for some . Below, we show it separately for PEG, OG and RG under Assumption 2 and with selected as per the theorem statement. Moreover, we have and for all methods, hence the corresponding bound in our statement. The arguments used in the proof are inspired from [Tse00, Mal15, GBVV+19] but we emphasize the relation between the analyses of these algorithms by putting forward the technical A.2.
Past Extra-Gradient (Peg).
where we used the fact that is -Lipschitz continuous for the second inequality.
Now, let us use Young’s inequality to get
and the non-expansiveness of the projection to get for any ,
where we used the fact that in the last inequality; and in order to display a telescopic term, we reformulate (B.8) as