Understanding the Acceleration Phenomenon via High-Resolution Differential Equations

Understanding the Acceleration Phenomenon via High-Resolution Differential Equations

Abstract

Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-SC) and Polyak’s heavy-ball method—we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.

Keywords. Convex optimization, first-order method, Polyak’s heavy ball method, Nesterov’s accelerated gradient methods, ordinary differential equation, Lyapunov function, gradient minimization, dimensional analysis, phase space representation, numerical stability

1 Introduction

Machine learning has become one of the major application areas for optimization algorithms during the past decade. While there have been many kinds of applications, to a wide variety of problems, the most prominent applications have involved large-scale problems in which the objective function is the sum over terms associated with individual data, such that stochastic gradients can be computed cheaply, while gradients are much more expensive and the computation (and/or storage) of Hessians is often infeasible. In this setting, simple first-order gradient descent algorithms have become dominant, and the effort to make these algorithms applicable to a broad range of machine learning problems has triggered a flurry of new research in optimization, both methodological and theoretical.

We will be considering unconstrained minimization problems,

(1.1)

where is a smooth convex function. Perhaps the simplest first-order method for solving this problem is gradient descent. Taking a fixed step size , gradient descent is implemented as the recursive rule

given an initial point .

As has been known at least since the advent of conjugate gradient algorithms, improvements to gradient descent can be obtained within a first-order framework by using the history of past gradients. Modern research on such extended first-order methods arguably dates to Polyak [Pol64, Pol87], whose heavy-ball method incorporates a momentum term into the gradient step. This approach allows past gradients to influence the current step, while avoiding the complexities of conjugate gradients and permitting a stronger theoretical analysis. Explicitly, starting from an initial point , the heavy-ball method updates the iterates according to

(1.2)

where is the momentum coefficient. While the heavy-ball method provably attains a faster rate of local convergence than gradient descent near a minimum of , it does not come with global guarantees. Indeed, [LRP16] demonstrate that even for strongly convex functions the method can fail to converge for some choices of the step size.1

The next major development in first-order methodology was due to Nesterov, who discovered a class of accelerated gradient methods that have a faster global convergence rate than gradient descent [Nes83, Nes13]. For a -strongly convex objective with -Lipschitz gradients, Nesterov’s accelerated gradient method (NAG-SC) involves the following pair of update equations:

(1.3)

given an initial point . Equivalently, NAG-SC can be written in a single-variable form that is similar to the heavy-ball method:

(1.4)

starting from and . Like the heavy-ball method, NAG-SC blends gradient and momentum contributions into its update direction, but defines a specific momentum coefficient . Nesterov also developed the estimate sequence technique to prove that NAG-SC achieves an accelerated linear convergence rate:

if the step size satisfies . Moreover, for a (weakly) convex objective with -Lipschitz gradients, Nesterov defined a related accelerated gradient method (NAG-C), that takes the following form:

(1.5)

with . The choice of momentum coefficient , which tends to one, is fundamental to the estimate-sequence-based argument used by Nesterov to establish the following inverse quadratic convergence rate:

(1.6)

for any step size . Under an oracle model of optimization complexity, the convergence rates achieved by NAG-SC and NAG-C are optimal for smooth strongly convex functions and smooth convex functions, respectively [NY83].

1.1 Gradient Correction: Small but Essential

Throughout the present paper, we let and to define a specific implementation of the heavy-ball method in (1.2). This choice of the momentum coefficient and the second initial point renders the heavy-ball method and NAG-SC identical except for the last (small) term in (1.4). Despite their close resemblance, however, the two methods are in fact fundamentally different, with contrasting convergence results (see, for example, [Bub15]). Notably, the former algorithm in general only achieves local acceleration, while the latter achieves acceleration method for all initial values of the iterate [LRP16]. As a numerical illustration, Figure 1 presents the trajectories that arise from the two methods when minimizing an ill-conditioned convex quadratic function. We see that the heavy-ball method exhibits pronounced oscillations throughout the iterations, whereas NAG-SC is monotone in the function value once the iteration counter exceeds .

This striking difference between the two methods can only be attributed to the last term in (1.4):

(1.7)

which we refer to henceforth as the gradient correction2. This term corrects the update direction in NAG-SC by contrasting the gradients at consecutive iterates. Although an essential ingredient in NAG-SC, the effect of the gradient correction is unclear from the vantage point of the estimate-sequence technique used in Nesterov’s proof. Accordingly, while the estimate-sequence technique delivers a proof of acceleration for NAG-SC, it does not explain why the absence of the gradient correction prevents the heavy-ball method from achieving acceleration for strongly convex functions.

Figure 1: A numerical comparison between NAG-SC and heavy-ball method. The objective function (ill-conditioned ) is , with the initial iterate .

A recent line of research has taken a different point of view on the theoretical analysis of acceleration, formulating the problem in continuous time and obtaining algorithms via discretization [SBC14, KBB15, WWJ16]). This can be done by taking continuous-time limits of existing algorithms to obtain ordinary differential equations (ODEs) that can be analyzed using the rich toolbox associated with ODEs, including Lyapunov functions3. For instance, [SBC16] shows that

(1.8)

with initial conditions and , is the exact limit of NAG-C (1.5) by taking the step size . Alternatively, the starting point may be a Lagrangian or Hamiltonian framework [WWJ16]. In either case, the continuous-time perspective not only provides analytical power and intuition, but it also provides design tools for new accelerated algorithms.

Unfortunately, existing continuous-time formulations of acceleration stop short of differentiating between the heavy-ball method and NAG-SC. In particular, these two methods have the same limiting ODE (see, for example, [WRJ16]):

(1.9)

and, as a consequence, this ODE does not provide any insight into the stronger convergence results for NAG-SC as compared to the heavy-ball method. As will be shown in Section 2, this is because the gradient correction is an order-of-magnitude smaller than the other terms in (1.4) if . Consequently, the gradient correction is not reflected in the low-resolution ODE (1.9) associated with NAG-SC, which is derived by simply taking in both (1.2) and (1.4).

1.2 Overview of Contributions

Just as there is not a singled preferred way to discretize a differential equation, there is not a single preferred way to take a continuous-time limit of a difference equation. Inspired by dimensional-analysis strategies widely used in fluid mechanics in which physical phenomena are investigated at multiple scales via the inclusion of various orders of perturbations [Ped13], we propose to incorporate terms into the limiting process for obtaining an ODE, including the (Hessian-driven) gradient correction in (1.7). This will yield high-resolution ODEs that differentiate between the NAG methods and the heavy-ball method.

We list the high-resolution ODEs that we derive in the paper here4:

  • The high-resolution ODE for the heavy-ball method (1.2):

    (1.10)

    with and .

  • The high-resolution ODE for NAG-SC (1.3):

    (1.11)

    with and .

  • The high-resolution ODE for NAG-C (1.5):

    (1.12)

    for , with and .

High-resolution ODEs are more accurate continuous-time counterparts for the corresponding discrete algorithms than low-resolution ODEs, thus allowing for a better characterization of the accelerated methods. This is illustrated in Figure 2, which presents trajectories and convergence of the discrete methods, and the low- and high-resolution ODEs. For both NAGs, the high-resolution ODEs are in much better agreement with the discrete methods than the low-resolution ODEs5. Moreover, for NAG-SC, its high-resolution ODE captures the non-oscillation pattern while the low-resolution ODE does not.

Figure 2: Top left and bottom left: trajectories and errors of NAG-SC and the heavy-ball method for minimizing , from the initial value , the same setting as Figure 1. Top right and bottom right: trajectories and errors of NAG-C for minimizing , from the initial value . For the two bottom plots, we use the identification between time and iterations for the x-axis.

The three new ODEs include terms that are not present in the corresponding low-resolution ODEs (compare, for example, (1.12) and (1.8)). Note also that if we let , each high-resolution ODE reduces to its low-resolution counterpart. Thus, the difference between the heavy-ball method and NAG-SC is reflected only in their high-resolution ODEs: the gradient correction (1.7) of NAG-SC is preserved only in its high-resolution ODE in the form . This term, which we refer to as the (Hessian-driven) gradient correction, is connected with the discrete gradient correction by the approximate identity:

for small , with the identification . The gradient correction in NAG-C arises in the same fashion6. Interestingly, although both NAGs are first-order methods, their gradient corrections brings in second-order information from the objective function.

Despite being small, the gradient correction has a fundamental effect on the behavior of both NAGs, and this effect is revealed by inspection of the high-resolution ODEs. We provide two illustrations of this.

  • Effect of the gradient correction in acceleration. Viewing the coefficient of as a damping ratio, the ratio of in the high-resolution ODE (1.11) of NAG-SC is adaptive to the position , in contrast to the fixed damping ratio in the ODE (1.10) for the heavy-ball method. To appreciate the effect of this adaptivity, imagine that the velocity is highly correlated with an eigenvector of with a large eigenvalue, such that the large friction effectively “decelerates” along the trajectory of the ODE (1.11) of NAG-SC. This feature of NAG-SC is appealing as taking a cautious step in the presence of high curvature generally helps avoid oscillations. Figure 1 and the left plot of Figure 2 confirm the superiority of NAG-SC over the heavy-ball method in this respect.

    If we can translate this argument to the discrete case we can understand why NAG-SC achieves acceleration globally for strongly convex functions but the heavy-ball method does not. We will be able to make this translation by leveraging the high-resolution ODEs to construct discrete-time Lyapunov functions that allow maximal step sizes to be characterized for the NAG-SC and the heavy-ball method. The detailed analyses is given in Section 3.

  • Effect of gradient correction in gradient norm minimization. We will also show how to exploit the high-resolution ODE of NAG-C to construct a continuous-time Lyapunov function to analyze convergence in the setting of a smooth convex objective with -Lipschitz gradients. Interestingly, the time derivative of the Lyapunov function is not only negative, but it is smaller than . This bound arises from the gradient correction and, indeed, it cannot be obtained from the Lyapunov function studied in the low-resolution case by [SBC16]. This finer characterization in the high-resolution case allows us to establish a new phenomenon:

    That is, we discover that NAG-C achieves an inverse cubic rate for minimizing the squared gradient norm. By comparison, from (1.6) and the -Lipschitz continuity of we can only show that . See Section 4 for further elaboration on this cubic rate for NAG-C.

As we will see, the high-resolution ODEs are based on a phase-space representation that provides a systematic framework for translating from continuous-time Lyapunov functions to discrete-time Lyapunov functions. In sharp contrast, the process for obtaining a discrete-time Lyapunov function for low-resolution ODEs presented by [SBC16] relies on “algebraic tricks” (see, for example, Theorem 6 of [SBC16]).

1.3 Related Work

There is a long history of using ODEs to analyze optimization methods [HM12, Sch00, Fio05]. Recently, the work of [SBC14, SBC16] has sparked a renewed interest in leveraging continuous dynamical systems to understand and design first-order methods and to provide more intuitive proofs for the discrete methods. Below is a rather incomplete review of recent work that uses continuous-time dynamical systems to study accelerated methods.

In the work of [WWJ16, WRJ16, BJW18], Lagrangian and Hamiltonian frameworks are used to generate a large class of continuous-time ODEs for a unified treatment of accelerated gradient-based methods. Indeed, [WWJ16] extend NAG-C to non-Euclidean settings, mirror descent and accelerated higher-order gradient methods, all from a single “Bregman Lagrangian.” In [WRJ16], the connection between ODEs and discrete algorithms is further strengthened by establishing an equivalence between the estimate sequence technique and Lyapunov function techniques, allowing for a principled analysis of the discretization of continuous-time ODEs. Recent papers have considered symplectic [BJW18] and Runge–Kutta [ZMSJ18] schemes for discretization of the low-resolution ODEs.

An ODE-based analysis of mirror descent has been pursued in another line of work by [KBB15, KBB16, KB17], delivering new connections between acceleration and constrained optimization, averaging and stochastic mirror descent.

In addition to the perspective of continuous-time dynamical systems, there has also been work on the acceleration from a control-theoretic point of view [LRP16, HL17, FRMP18] and from a geometric point of view [BLS15, CML17]. See also [OC15, FB15, GL16, LMH18, DFR18] for a number of other recent contributions to the study of the acceleration phenomenon.

1.4 Organization and Notation

The remainder of the paper is organized as follows. In Section 2, we briefly introduce our high-resolution ODE-based analysis framework. This framework is used in Section 3 to study the heavy-ball method and NAG-SC for smooth strongly convex functions. In Section 4, we turn our focus to NAG-C for a general smooth convex objective. In Section 5 we derive some extensions of NAG-C. We conclude the paper in Section 6 with a list of future research directions. Most technical proofs are deferred to the Appendix.

We mostly follow the notation of [Nes13], with slight modifications tailored to the present paper. Let be the class of -smooth convex functions defined on ; that is, if for all and its gradient is -Lipschitz continuous in the sense that

where denotes the standard Euclidean norm and is the Lipschitz constant. (Note that this implies that is also -Lipschitz for any .) The function class is the subclass of such that each has a Lipschitz-continuous Hessian. For , let denote the subclass of such that each member is -strongly convex for some . That is, if and

for all . Note that this is equivalent to the convexity of , where denotes a minimizer of the objective .

2 The High-Resolution ODE Framework

This section introduces a high-resolution ODE framework for analyzing gradient-based methods, with NAG-SC being a guiding example. Given a (discrete) optimization algorithm, the first step in this framework is to derive a high-resolution ODE using dimensional analysis, the next step is to construct a continuous-time Lyapunov function to analyze properties of the ODE, the third step is to derive a discrete-time Lyapunov function from its continuous counterpart and the last step is to translate properties of the ODE into that of the original algorithm. The overall framework is illustrated in Figure 3.

Algorithms

High-Resolution ODEs

Continuous

Discrete

Nesterov’s   Acceleration

Gradient Norm Minimization

dimensional analysis

phase-space representation

Figure 3: An illustration of our high-resolution ODE framework. The three solid straight lines represent Steps 1, 2 and 3, and the two curved lines denote Step 4. The dashed line is used to emphasize that it is difficult, if not impractical, to construct discrete Lyapunov functions directly from the algorithms.

Step 1: Deriving High-Resolution ODEs

Our focus is on the single-variable form (1.4) of NAG-SC. For any nonnegative integer , let and assume for some sufficiently smooth curve . Performing a Taylor expansion in powers of , we get

(2.1)

We now use a Taylor expansion for the gradient correction, which gives

(2.2)

Multiplying both sides of (1.4) by and rearranging the equality, we can rewrite NAG-SC as

(2.3)

Next, plugging (2.1) and (2.2) into (2.3), we have7

which can be rewritten as

Multiplying both sides of the last display by , we obtain the following high-resolution ODE of NAG-SC:

where we ignore any terms but retain the terms (note that ).

Our analysis is inspired by dimensional analysis [Ped13], a strategy widely used in physics to construct a series of differential equations that involve increasingly high-order terms corresponding to small perturbations. In more detail, taking a small , one first derives a differential equation that consists only of terms, then derives a differential equation consisting of both and , and next, one proceeds to obtain a differential equation consisting of and terms. High-order terms in powers of are introduced sequentially until the main characteristics of the original algorithms have been extracted from the resulting approximating differential equation. Thus, we aim to understand Nesterov acceleration by incorporating terms into the ODE, including the (Hessian-driven) gradient correction which results from the (discrete) gradient correction (1.7) in the single-variable form (1.4) of NAG-SC. We also show (see Appendix A.1 for the detailed derivation) that this term appears in the high-resolution ODE of NAG-C, but is not found in the high-resolution ODE of the heavy-ball method.

As shown below, each ODE admits a unique global solution under mild conditions on the objective, and this holds for an arbitrary step size . The solution is accurate in approximating its associated optimization method if is small. To state the result, we use to denote the class of twice-continuously-differentiable maps from to for (the heavy-ball method and NAG-SC) and (NAG-C).

Proposition 2.1.

For any , each of the ODEs (1.10) and (1.11) with the specified initial conditions has a unique global solution . Moreover, the two methods converge to their high-resolution ODEs, respectively, in the sense that

for any fixed .

In fact, Proposititon 2.1 holds for because both the discrete iterates and the ODE trajectories converge to the unique minimizer when the objective is stongly convex.

Proposition 2.2.

For any , the ODE (1.12) with the specified initial conditions has a unique global solution . Moreover, NAG-C converges to its high-resolution ODE in the sense that

for any fixed .

The proofs of these propositions are given in Appendix A.3.1 and Appendix A.3.2.

Step 2: Analyzing ODEs Using Lyapunov Functions

With these high-resolution ODEs in place, the next step is to construct Lyapunov functions for analyzing the dynamics of the corresponding ODEs, as is done in previous work [SBC16, WRJ16, LRP16]. For NAG-SC, we consider the Lyapunov function

(2.4)

The first and second terms and can be regarded, respectively, as the potential energy and kinetic energy, and the last term is a mix. For the mixed term, it is interesting to note that the time derivative of equals .

The differentiability of will allow us to investigate properties of the ODE (1.11) in a principled manner. For example, we will show that decreases exponentially along the trajectories of (1.11), recovering the accelerated linear convergence rate of NAG-SC. Furthermore, a comparison between the Lyapunov function of NAG-SC and that of the heavy-ball method will explain why the gradient correction yields acceleration in the former case. This is discussed in Section 3.1.

Step 3: Constructing Discrete Lyapunov Functions

Our framework make it possible to translate continuous Lyapunov functions into discrete Lyapunov functions via a phase-space representation (see, for example, [Arn13]). We illustrate the procedure in the case of NAG-SC. The first step is formulate explicit position and velocity updates:

(2.5)

where the velocity variable is defined as:

The initial velocity is . Interestingly, this phase-space representation has the flavor of symplectic discretization, in the sense that the update for is explicit (it only depends on the last iterate ) while the update for is implicit (it depends on the current iterates and )8.

The representation (2.5) suggests translating the continuous-time Lyapunov function (2.4) into a discrete-time Lyapunov function of the following form:

(2.6)

by replacing continuous terms (e.g., ) by their discrete counterparts (e.g., ). Akin to the continuous (2.4), here , , and correspond to potential energy, kinetic energy, and mixed energy, respectively, from a mechanical perspective. To better appreciate this translation, note that the factor in results from the term in (2.5). Likewise, in is from the term in (2.5). The need for the final (small) negative term is technical; we discuss it in Section 3.2.

Step 4: Analyzing Algorithms Using Discrete Lyapunov Functions

The last step is to map properties of high-resolution ODEs to corresponding properties of optimization methods. This step closely mimics Step 2 except that now the object is a discrete algorithm and the tool is a discrete Lyapunov function such as (2.6). Given that Step 2 has been performed, this translation is conceptually straightforward, albeit often calculation-intensive. For example, using the discrete Lyapunov function (2.6), we will recover the optimal linear rate of NAG-SC and gain insights into the fundamental effect of the gradient correction in accelerating NAG-SC. In addition, NAG-C is shown to minimize the squared gradient norm at an inverse cubic rate by a simple analysis of the decreasing rate of its discrete Lyapunov function.

3 Gradient Correction for Acceleration

In this section, we use our high-resolution ODE framework to analyze NAG-SC and the heavy-ball method. Section 3.1 focuses on the ODEs with an objective function , and in Section 3.2 we extend the results to the discrete case for . Finally, in Section 3.3 we offer a comparative study of NAG-SC and the heavy-ball method from a finite-difference viewpoint.

Throughout this section, the strategy is to analyze the two methods in parallel, thereby highlighting the differences between the two methods. In particular, the comparison will demonstrate the vital role of the gradient correction, namely in the discrete case and in the ODE case, in making NAG-SC an accelerated method.

3.1 The ODE Case

The following theorem characterizes the convergence rate of the high-resolution ODE corresponding to NAG-SC.

Theorem 1 (Convergence of NAG-Sc Ode).

Let . For any step size , the solution of the high-resolution ODE (1.11) satisfies

The theorem states that the functional value tends to the minimum at a linear rate. By setting , we obtain .

The proof of Theorem 1 is based on analyzing the Lyapunov function for the high-resolution ODE of NAG-SC. Recall that defined in (2.4) is

The next lemma states the key property we need from this Lyapunov function

Lemma 3.1 (Lyapunov function for NAG-Sc Ode).

Let . For any step size , and with being the solution to the high-resolution ODE (1.11), the Lyapunov function (2.4) satisfies

(3.1)

The proof of this theorem relies on Lemma 3.1 through the inequality . The term plays no role at the moment, but Section 3.2 will shed light on its profound effect in the discretization of the high-resolution ODE of NAG-SC.

Proof of Theorem 1.

Lemma 3.1 implies , which amounts to

By integrating out , we get

(3.2)

Recognizing the initial conditions and , we write (3.2) as

Since , we have that and . Together with the Cauchy–Schwarz inequality, the two inequalities yield

which is valid for all . To simplify the coefficient of , note that can be replaced by in the analysis since . It follows that

Furthermore, a bit of analysis reveals that

since , and this step completes the proof of Theorem 1. ∎

We now consider the heavy-ball method (1.2). Recall that the momentum coefficient is set to . The following theorem characterizes the rate of convergence of this method.

Theorem 2 (Convergence of heavy-ball ODE).

Let . For any step size , the solution of the high-resolution ODE (1.10) satisfies

As in the case of NAG-SC, the proof of Theorem 2 is based on a Lyapunov function:

(3.3)

which is the same as the Lyapunov function (2.4) for NAG-SC except for the lack of the term. In particular, (2.4) and (3.3) are identical if . The following lemma considers the decay rate of (3.3).

Lemma 3.2 (Lyapunov function for the heavy-ball ODE).

Let . For any step size , the Lyapunov function (3.3) for the high-resolution ODE (1.10) satisfies

The proof of Theorem 2 follows the same strategy as the proof of Theorem 1. In brief, Lemma 3.2 gives by integrating over the time parameter . Recognizing the initial conditions

in the high-resolution ODE of the heavy-ball method and using the -smoothness of , Lemma 3.2 yields

if the step size . Finally, since , the coefficient satisfies

The proofs of Lemma 3.1 and Lemma 3.2 share similar ideas. In view of this, we present only the proof of the former here, deferring the proof of Lemma 3.2 to Appendix B.1.

Proof of Lemma 3.1.

Along trajectories of (1.11) the Lyapunov function (2.4) satisfies

(3.4)

Furthermore, is greater than or equal to both and due to the -strong convexity of . This yields

which together with (3.4) suggests that the time derivative of this Lyapunov function can be bounded as

(3.5)

Next, the Cauchy–Schwarz inequality yields

from which it follows that

(3.6)

Combining (3.5) and (3.6) completes the proof of the theorem.

Remark 3.3.

The only inequality in (3.4) is due to the term , which is discussed right after the statement of Lemma 3.1. This term results from the gradient correction in the NAG-SC ODE. For comparison, this term does not appear in Lemma 3.2 in the case of the heavy-ball method as its ODE does not include the gradient correction and, accordingly, its Lyapunov function (3.3) is free of the term.

3.2 The Discrete Case

This section carries over the results in Section 3.1 to the two discrete algorithms, namely NAG-SC and the heavy-ball method. Here we consider an objective since second-order differentiability of is not required in the two discrete methods. Recall that both methods start with an arbitrary and