The Approximate Gap Technique:A Unified Approach to Optimal First-Order Methods1footnote 11footnote 1Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing. It was partially supported by NSF grant #CCF-1718342 and by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF grant #CCF-1740425.

The Approximate Gap Technique: A Unified Approach to Optimal First-Order Methods111Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing. It was partially supported by NSF grant #CCF-1718342 and by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF grant #CCF-1740425.

Jelena Diakonikolas and Lorenzo Orecchia
Computer Science Department, Boston University
Abstract

We present a general technique for the analysis of first-order methods. The technique relies on the construction of a duality gap for an appropriate approximation of the objective function, where the function approximation improves as the algorithm converges. We show that in continuous time enforcement of an invariant that this approximate duality gap decreases at a certain rate exactly recovers a wide range of first-order continuous-time methods. We characterize the discretization errors incurred by different discretization methods, and show how iteration-complexity-optimal methods for various classes of problems cancel out the discretization error. The techniques are illustrated on various classes of problems – including solving variational inequalities for smooth monotone operators, convex minimization for Lipschitz-continuous objectives, smooth convex minimization, composite minimization, and smooth and strongly convex minimization – and naturally extend to other settings.

1 Introduction

First-order optimization methods have recently gained high popularity due to their applicability to large-scale problem instances arising from modern datasets, their relatively low computational complexity, and their potential for parallelizing computation [26]. Moreover, such methods have also been successfully applied in discrete optimization leading to faster numerical methods [25, 9], graph algorithms [8, 24, 12], and submodular optimization methods [7].

Most first-order optimization methods can be obtained from the discretization of continuous-time dynamical systems that converge to optimal solutions. In the case of mirror descent, the continuous-time view was the original motivation for the algorithm [15], while more recent work has focused on deducing continuous-time interpretations of accelerated methods [29, 30, 11, 27, 23].

Motivated by these works, we focus on providing a unifying theory of first-order methods as discretizations of continuous-time dynamical systems. We term this general framework the Approximate Duality Gap Technique (adgt). In addition to providing an intuitive and unified convergence analysis of various first-order methods that is often only a few lines long, adgt is also valuable in developing new first-order methods with tight convergence bounds [4].

Unlike traditional approaches that start from an algorithm description and then use arguments such as Lyapunov stability criteria to prove convergence bounds [15, 29, 30, 11, 27], our approach takes the opposite direction: continuous-time methods are obtained from the analysis, using purely optimization-motivated arguments. In particular, we present a general approach for constructing an approximate duality gap, based on the properties of a considered optimization problem. For convex optimization problems, the approximate duality gap is obtained as the difference of an upper bound and a lower bound on the optimal objective value. The upper bound at time , is obtained directly from the function values at points constructed by an algorithm. The lower bound , on the other hand, can be interpreted as a Fenchel dual of an approximation to the objective function, giving rise to a primal-dual interpretation of the methods. For a convergence rate , the approximation error of the lower bound reduces at the same rate. The continuous-time methods are obtained from enforcing an invariant that , where ; that is, the approximate duality gap reduces at rate .

To illustrate the power and generality of the technique, we show how to obtain and analyze several well-known first-order methods, such as dual averaging [19], mirror-prox/extra-gradient method [10, 18, 14], accelerated methods [16, 17], composite minimization methods [6, 22], and Frank-Wolfe methods [21]. The same ideas naturally extend to other classes of convex optimization problems and their corresponding optimal first-order methods. Here, “optimal” is in the sense that the methods yield worst-case iteration complexity bounds for which there is a matching lower bound (i.e., “optimal” is in terms of iteration complexity).

1.1 Related Work

There exists a large body of research in optimization and first-order methods in particular, and, while we cannot provide a thorough literature review, we refer the reader to recent books [26, 3, 2, 20].

Multiple approaches to unifying analyses of first-order methods have been developed, with a particular focus on explaining the acceleration phenomenon. Tseng gives a formal framework that unifies all the different instantiations of accelerated gradient methods [28]. More recently, Allen-Zhu and Orecchia [1] interpret acceleration as coupling of mirror descent and gradient descent steps. Bubeck et al. provide an elegant geometric interpretation of the Euclidean instantiation of Nesterov’s method [3]. Drusvyatskiy et al[5] interpret the geometric descent of Bubeck et al[3] as a sequence minimizing quadratic lower-models of the objective function and obtain limited-memory extensions with improved performance. Lin et al[13] provide a universal scheme for accelerating non-accelerated first-order methods.

Su et al[27] and Krichene et al[11] interpret Nesterov’s accelerated method as a discretization of a certain continuous-time dynamics and analyze it using Lyapunov stability criteria. Wibisono et al[29] and Wilson et al[30] interpret accelerated methods using Lyapunov stability analysis and drawing ideas from Lagrangian mechanics. Scieur et al[23] interpret acceleration as a multi-step integration method from numerical analysis applied to the gradient flow differential equation.

The approximate duality gap presented here is closely related to Nesterov’s estimate sequence (see e.g., [20]); in particular, Nesterov’s estimate sequence is equivalent to the Fenchel dual of the objective function approximation used in constructing a lower bound .

1.2 Notation

We use non-boldface letters to denote scalars and boldface letters to denote vectors. Superscript index denotes the value of at time . The “dot” notation is used to denote the time derivative, i.e., . Given a measure defined on , we use the Lebesgue-Stieltjes notation for the integral. In particular, given defined on :

 ∫tt0ϕ(τ)˙α(τ)dτ=∫tt0ϕ(τ)dα(τ).

When is a discrete measure and both and are defined on , we have that , where . We denote , so that . We assume throughout the paper that , where is the initial point of the (continuous or discrete) dynamics, and use the following notation for the aggregated gradients:

 z(t)\lx@stackreldef=−∫tt0∇f(x(τ))dα(τ). (1.1)

For all considered problems, we assume that the feasible region is a closed convex set , for a finite . We assume that there is a (fixed) norm associated with and define its dual norm in a standard way: , where denotes the inner product.

1.3 Preliminaries

We focus on minimizing a continuous and differentiable222The differentiability assumption is not always necessary and can be relaxed to subdifferentiability in the case of dual averaging/mirror descent methods. Nevertheless, we will assume differentiability throughout the paper to keep the presentation simple. convex function defined on a convex set , and we let denote the minimizer of on . The following definitions will be useful in our analysis, and thus we state them here for completeness.

Definition 1.1.

A function is convex on , if for all : .

Definition 1.2.

A function is smooth on with smoothness parameter and with respect to a norm , if for all : . Equivalently: .

Definition 1.3.

A function is strongly convex on with strong convexity parameter and with respect to a norm , if for all : . Equivalently: .

Definition 1.4.

(Bregman Divergence) Bregman divergence of a function is defined as: .

Definition 1.5.

(Convex Conjugate) Function is the convex conjugate of , if , .

As is assumed to be closed, in Definition 1.5 can be replaced by .

We will assume that there is a differentiable function , possibly dependent on (in which case we denote it as ), such that is easily solvable, possibly in a closed form. Notice that this problem defines the convex conjugate of , i.e., . The following standard fact will be extremely useful in carrying out the analysis of the algorithms in this paper.

Fact 1.6.

Let be a differentiable strongly convex function. Then:

 ∇ϕ∗(z)=argmaxx∈X{⟨z,x⟩−ϕ(x)}=argminx∈X{−⟨z,x⟩+ϕ(x)}.

In particular, Fact 1.6 implies:

 ∇ϕ∗(z(t))=argminx∈X{∫tt0⟨∇f(x(τ)),x−x(τ)⟩dα(τ)+ϕ(x)} (1.2)

Some other useful properties of Bregman divergence can be found in Appendix A.

Overview of Continuous-Time Operations

In continuous time, changes in the variables are described by differential equations. Of particular interest are (weighted) aggregation and averaging. Aggregation of a function is . Observe that, by integrating both sides from to , this is equivalent to: . Averaging of a function is . This can be equivalently written as , implying .

The following simple proposition (based on Danskin’s Theorem) will be very useful in our analysis.

.

Proof.

Observe that . By Fact 1.6,

Using Danskin’s theorem (which allows us to differentiate inside the min), we can compute by differentiating with respect to and using the chain rule, which yields the claimed identity. ∎

2 The Approximate Duality Gap Technique

Underlying the analysis of all first-order methods considered here is the notion of a lower bound and an upper bound of the optimal objective value , together with the approximate duality gap defined as . The explicit construction of the upper and lower bounds allows us to take a unified primal-dual view of the methods and quantify the convergence rate as the rate at which the approximate gap decreases as a function of time . We refer to this general framework of constructing and quantifying the optimality gap as the Approximate Duality Gap Technique (adgt).

2.1 Upper Bound

Since we are interested in minimizing a function , any point leads to a valid upper bound, as . Suppose that a minimization algorithm constructs a sequence of points . To keep the gap between upper and lower bounds as small as possible, a natural choice of at time is the best seen point, that is, . However, different choices of upper bounds will turn out to be more useful for the analysis, such as the function value at the last seen point or a convex combination of the function values at all seen points for , i.e., .

2.2 Lower Bound

The construction of the lower bound is more interesting, since it allows us to take a dual viewpoint of the algorithm behavior. Suppose that we are given a sequence of points , for . The convexity of leads to the lower-bounding hyperplanes, as , . Given a non-negative measure and , we have the following natural lower bound:

 f(x∗)≥1A(t)∫tt0(f(x(τ))+⟨∇f(x(τ)),x∗−x(τ)⟩)dα(τ). (2.1)

However, there are major issues in using (2.1) as the lower bound. First issue is that it is not defined at the initial time in the continuous-time domain. To circumvent this issue, we can just mix in the trivial lower bound , which gives us:

 f(x∗)≥1α(t)∫tt0(f(x(τ))+⟨∇f(x(τ)),x∗−x(τ)⟩)dα(τ)+α(t)−A(t)α(t)f(x∗). (2.2)

Observe that in the discrete time , and so the term does not appear in the lower bound, while in the continuous-time .

While (2.2) is a valid lower bound, it is not particularly useful, since we do not know . This issue can be resolved by replacing with . However, this would in general lead to a lower bound that is not differentiable and that in some cases is trivially equal to , which is still not very useful. The solution is to regularize the right-hand side of (2.2) before taking the minimum. This is achieved by adding a strictly convex function to both sides.333This approach is similar to the Moreau-Yosida regularization. We thus finally obtain the following lower bound:

 L(t)\lx@stackreldef= (2.3) +(α(t)−A(t))f(x∗)−ϕ(x∗)α(t).

Dual View of the Lower Bound

Let , . Then, the minimum in the lower bound defines a concave conjugate of at , thus:

 L(t)= ∫tt0(f(x(τ))−⟨∇f(x(τ)),x(τ)⟩)dα(τ)α(t) +A(t)ψ∗t(u(t))+(α(t)−A(t))f(x∗)+A(t)ψt(x∗)α(t).

By the definition of a convex conjugate and Fact 1.6, . Applying Jensen’s inequality and using that , the lower bound implies:

 f(x∗)−ψt(x∗)≥−A(t)(f∗(u(t))−ψ∗t(u(t)))α(t)+(α(t)−A(t))(f(x∗)−ψt(x∗))α(t) ⇔ f(x∗)−ψt(x∗)≥−(f∗(u(t))−ψ∗t(u(t))).

That is, the lower bound is exactly obtained from the Fenchel dual of the function evaluated at the average gradient , and corrected by the approximation error .

We note that there was nothing special we have used about : in fact, using the same arguments for an arbitrary point would lead to the inequality . Therefore, the lower bound is constructed from the Fenchel dual of the approximation of our objective function

Extension to Strongly Convex Objectives

When the objective is -strongly convex for some , we can use -strong convexity (instead of just regular convexity) in the construction of the lower bound. This will generally give us a better lower bound which will lead to the better convergence guarantees in the discrete-time domain. It is not hard to verify (by repeating the same arguments as above) that in this case we have the following lower bound:

 L(t)= ∫tt0f(x(τ))dα(τ)α(t)+(A(t)−α(t))f(x∗)−ϕ(x∗)α(t) (2.4)

Extension to Composite Objectives

Suppose now that we have a composite objective . Then, we can choose to apply the convexity argument only to and use as a regularizer (this will be particularly useful in the discrete-time domain in the settings where has some smoothness properties while is generally non-smooth). Therefore, we could start with . Repeating the same arguments as in the general construction of the lower bound presented earlier in this subsection:

 L(t)\lx@stackreldef= (2.5) +minx∈X{∫tt0⟨∇f(x(τ)),x−x(τ)⟩dα(τ)+A(t)ψ(x)+ϕ(x)}α(t).

2.3 The Approximate Gap

The gap is simply defined as . The main idea is to show that is a non-increasing function of time , implying that . In some discrete-time cases, we will relax this requirement to , for some sufficiently small error .

Extension to Monotone Operators and Saddle-Point Formulations

The notion of the approximate gap can be defined for problem classes beyond convex minimization. Examples are monotone operators and convex-concave saddle-point problems. Given a monotone operator , the goal is to find a point such that , . The approximate version of this problem is:

 (2.6)

and we can think of on the right-hand side of (2.6) as the optimality gap.

The property of monotone operators useful for the approximate gap analysis is : . The approximate gap can be constructed using the same ideas as in the case of a convex function, which, letting , gives, :

 ⟨F(u),^x(t)−u⟩≤G(t) (2.7) +α(t)−A(t)α(t)maxv∈X⟨F(v),x(0)−v⟩,

Now assume that we want to find a saddle point of a function that is convex in and concave in . By convexity in and concavity in , we have that, for all and all :

 Φ(v,w(τ))−Φ(v(τ),w(τ))≥⟨∇vΦ(v(τ),w(τ)),v−v(τ)⟩, (2.8) Φ(v(τ),w)−Φ(v(τ),w(τ))≤⟨∇wΦ(v(τ),w(τ)),w−w(τ)⟩, (2.9)

where (resp. ) denotes the gradient w.r.t. (resp. ).

Combining (2.8) and (2.9), it follows that, , :

 Φ(v(τ),w)−Φ(v,w(τ))

Let , , and , . Then, we have, :

 Φ(¯¯¯v,w)−Φ(v,¯¯¯¯¯w)≤∫tt0⟨F(x),x(τ)−x⟩dα(τ)A(t),

and using the same arguments as before, we obtain the same bound for the gap as (2.7). Therefore, we can focus only on analyzing the decrease of from (2.7) as a function of and the same result will follow for the gap of convex-concave saddle-point problems.

3 First-Order Methods in Continuous Time

We now show how different choices of the upper bound and the lower bound (and, consequently, the gap) lead to different first-order methods.

3.1 Mirror Descent/Dual Averaging Methods

Recall the expression for the general lower bound (2.3). Since is a convex combination of the objective function values evaluated at feasible points, a natural choice of an upper bound would be . However, since we do not know and we would like to have a point such that (this would be the point that the algorithm returns at time ), we can choose:

 U(t)=∫tt0f(x(τ))dα(τ)α(t)+α(t)−A(t)α(t)f(x(0)),

where is an arbitrary initial point. Then, the gap becomes:

 G(t)= −minx∈X{∫tt0⟨∇f(x(τ)),x−x(τ)⟩dα(τ)+ϕ(x)}α(t) (3.1) +(α(t)−A(t))(f(x(0))−f(x∗))+ϕ(x∗)α(t).

Observe that . Thus, if we show that , this would immediately imply:

To ensure that is bounded, a common choice is for some convex function .

Now we show that ensuring the invariance exactly produces the continuous-time mirror descent dynamics. Recall that . Using (1.2) and Proposition (1.7):

 ddt(α(t)G(t))=−⟨∇f(x(t)),∇ϕ∗(z(t))−x(t)⟩˙α(t).

Thus, to have , we can set , which is precisely the mirror descent dynamics from [15]:

 ˙z(t)=−˙α(t)∇f(x(t)),x(t)=∇ϕ∗(z(t)),˙^x(t)=˙α(t)x(t)−^x(t)α(t),z(t0)=0,^x(t0)=x(0), for arbitrary initial point x(0)∈X. (CT-MD)

Observe that, by replacing by and by the same arguments, (CT-MD) also ensures for the gap (2.7) derived for monotone operators and saddle-point problems.

The described results are summarized in the following lemma:

Lemma 3.1.

Let evolve according to (CT-MD) for some convex function . Then, :

 f(^x(t))−f(x∗)≤α(t0)(f(x(0))−f(x∗))+ϕ(x∗)α(t).

If instead of convex minimization we are given a variational inequality problem with monotone operator , then a version of (CT-MD) that replaces by ensures that, , :

 ⟨F(u),^x(t)−u⟩≤maxx′∈Xϕ(x′)+α(t0)maxx′′∈X⟨F(x′′),x(0)−x′′⟩α(t).

Moreover, for a convex-concave saddle-point problem , taking , , then the version of (CT-MD) that uses the monotone operator ensures that, , :

 Φ(^v(t),w)−Φ(v,^w(t)) ≤maxx′∈Xϕ(x′)+α(t0)maxv′′∈V,w′′∈W{Φ(^v(t0),w′′)−Φ(v′′,^w(t0))}α(t).

3.2 Accelerated Convex Minimization

The previous subsection considered the choice of the upper bound that takes all seen points into account. We may hope, however, that there is an algorithm whose objective function value at the last seen point is “good enough”. Interestingly, we will see that choosing and from (2.3) results in the accelerated dynamics. To do so, as before, we need to show that the invariance condition is satisfied for some choice of . We have:

 ddt(α(t)G(t))= ddt(α(t)f(x(t)))−˙α(t)(f(x(t))+⟨∇f(x(t)),∇ϕ∗(z(t))−x(t)⟩) = α(t)ddt(f(x(t)))−˙α(t)⟨∇f(x(t)),∇ϕ∗(z(t))−x(t)⟩ = ⟨∇f(x(t)),α(t)˙x(t)−˙α(t)(∇ϕ∗(z(t))−x(t))⟩,

where we have used . Choosing , we get . This is precisely the accelerated mirror descent dynamics [11, 29], and we immediately get the convergence result stated as Lemma 3.2 below.

 ˙z(t)=−˙α(t)∇f(x(t)),˙x(t)=˙α(t)∇ϕ∗(z(t))−x(t)α(t),z(t0)=0,x(t0)=x(0), for arbitrary initial point x(0)∈X. (CT-AMD)
Lemma 3.2.

Let evolve according to (CT-AMD), for some convex function . Then, :

 f(x(t))−f(x∗)≤α(t0)(f(x(0))−f(x∗))+ϕ(x∗)α(t).

3.3 Accelerated Strongly Convex Minimization

We can also get a similar accelerated dynamics when the function is, in addition, strongly convex. In that case, we use and the lower bound from (2.4). Let . Observe that , . Then, we have the following result for the change in the gap:

 ddt(α(t)G(t))= ddt(α(t)f(x(t)))−˙α(t)(f(x(t))+⟨∇f(x(t)),∇ϕ∗t(z(t))−x(t)⟩) −ddτ(ϕτ(∇ϕ∗t(z(t))))∣∣τ=t ≤ ⟨∇f(x(t)),α(t)˙x(t)−˙α(t)(∇ϕ∗t(z(t))−x(t))⟩.

Therefore, choosing gives , and the convergence result stated as Lemma 3.3 below follows.

 ˙z(t)=−˙α(t)∇f(x(t)),˙x(t)=˙α(t)∇ϕ∗t(z(t))−x(t)α(t),z(t0)=0,x(t0)=x(0) for arbitrary initial point x(0)∈X. (CT-ASC)
Lemma 3.3.

Let evolve according to (CT-ASC), for an arbitrary initial point and . Then, :

 f(x(t))−f(x∗)≤α(t0)(f(x(0))−f(x∗))+ϕ(x∗)α(t).

We note that, while there is no difference in the algorithm or in the convergence bound for (CT-AMD) and (CT-ASC) in the continuous-time domain, in the discrete time these two algorithms will lead to very different convergence bounds, due to the different discretization errors they incur.

3.4 Composite Mirror Descent

Now assume that the objective is composite: , where is convex and is easily computable, for . Then, we can use the lower bound for composite functions (2.5). Let upper bound be:

 U(t) =1α(t)∫tt0¯f(x(τ))dα(τ)+α(t0)α(t)¯f(x(0)) =1α(t)∫tt0(f(x(τ))+ψ(x(τ)))dα(τ)+α(t0)α(t)(f(x(0))+ψ(x(0))).

Then, the change in the gap is:

 ddt(α(t)G(t))=˙α(t)ψ(x(t))−˙α(t)⟨∇f(x(t))