Fast convex optimization via inertial dynamics with Hessian driven damping

# Fast convex optimization via inertial dynamics with Hessian driven damping

Hedy Attouch Institut Montpelliérain Alexander Grothendieck, UMR 5149 CNRS, Université Montpellier 2, place Eugène Bataillon, 34095 Montpellier cedex 5, France Juan Peypouquet Univesidad Técnica Federico Santa Maria, Av Espana 1680, Valparaiso, Chile  and  Patrick Redont Institut Montpelliérain Alexander Grothendieck, UMR 5149 CNRS, Université Montpellier 2, place Eugène Bataillon, 34095 Montpellier cedex 5, France
23 oct. 2015
###### Abstract.

We first study the fast minimization properties of the trajectories of the second-order evolution equation

 ¨x(t)+αt˙x(t)+β∇2Φ(x(t))˙x(t)+∇Φ(x(t))=0,

where is a smooth convex function acting on a real Hilbert space , and , are positive parameters. This inertial system combines an isotropic viscous damping which vanishes asymptotically, and a geometrical Hessian driven damping, which makes it naturally related to Newton’s and Levenberg-Marquardt methods. For , and , along any trajectory, fast convergence of the values

 Φ(x(t))−minHΦ=O(t−2)

is obtained, together with rapid convergence of the gradients to zero. For , just assuming that we show that any trajectory converges weakly to a minimizer of , and that . Strong convergence is established in various practical situations. In particular, for the strongly convex case, we obtain an even faster speed of convergence which can be arbitrarily fast depending on the choice of . More precisely, we have . Then, we extend the results to the case of a general proper lower-semicontinuous convex function . This is based on the crucial property that the inertial dynamic with Hessian driven damping can be equivalently written as a first-order system in time and space, allowing to extend it by simply replacing the gradient with the subdifferential. By explicit-implicit time discretization, this opens a gate to new possibly more rapid inertial algorithms, expanding the field of FISTA methods for convex structured optimization problems.

###### Key words and phrases:
Convex optimization, fast convergent methods, dynamical systems, gradient flows, inertial dynamics, vanishing viscosity, Hessian-driven damping, non-smooth potential, forward-backward algorithms, FISTA
Effort sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-14-1-0056. Also supported by Fondecyt Grant 1140829, Conicyt Anillo ACT-1106, ECOS-Conicyt Project C13E03, Millenium Nucleus ICM/FIC RC130003, Conicyt Project MATHAMSUD 15MATH-02, Conicyt Redes 140183, and Basal Project CMM Universidad de Chile.

## Introduction

Throughout the paper, is a real Hilbert space endowed with scalar product and norm for . Let be a twice continuously differentiable convex function (the case of a nonsmooth function will be considered later on). In view of the minimization of , we study the asymptotic behaviour (as ) of the trajectories of the second-order differential equation

 (1) (DIN-AVD)¨x(t)+αt˙x(t)+β∇2Φ(x(t))˙x(t)+∇Φ(x(t))=0

where and are positive parameters.

This inertial system combines two types of damping:

In the first place, the term furnishes an isotropic linear damping with a viscous parameter which vanishes asymptotically, but not too slowly. The asymptotic behavior of the inertial gradient-like system

 (2) (AVD)¨x(t)+a(t)˙x(t)+∇Φ(x(t))=0,

with Asymptotic Vanishing Damping ((AVD) for short), has been studied by Cabot, Engler and Gaddat in [21]-[22]. They proved that, under moderate decrease of to zero, namely, that and , every solution of (2) satisfies .

Interestingly, with the specific choice :

 (3) ¨x(t)+αt˙x(t)+∇Φ(x(t))=0,

Su, Boyd and Candès in [36] proved the fast convergence property

 (4) Φ(x(t))−minHΦ=O(t−2),

provided . In the same article, the authors show that, for , (3) can be seen as a continuous-time version of the fast convergent method of Nesterov [29]-[30]-[31]-[32]. In [13], Attouch, Peypouquet and Redont showed that, for , each trajectory of (3) converges weakly to an element of . This result is a continuous-time counterpart to the Chambolle-Dossal algorithm [23], which is a modified Nesterov algorithm specially designed to obtain the convergence of the iterates.

In the second place, a geometrical damping, attached to the term , has a natural link with Newton’s method. It gives rise to the so-called Dynamical Inertial Newton system ((DIN) for short)

 (5) (DIN)¨x(t)+γ˙x(t)+β∇2Φ(x(t))˙x(t)+∇Φ(x(t))=0,

which has been introduced by Alvarez, Attouch, Bolte and Redont in [6] ( is a fixed positive parameter). Interestingly, (5) can be equivalently written as a first-order system involving only the gradient of , which allows its extension to the case of a proper lower-semicontinuous convex function . This led to applications ranging from optimization algorithms [12] to unilateral mechanics and partial differential equations [11].

As we shall see, (DIN-AVD) inherits the convergence properties of both (AVD) and (DIN), but exhibits other important features, namely (see Theorems 1.10, 1.14, 1.15, 3.1, 4.8, 4.11, 4.12):

• Assuming , and , we show the fast convergence property of the values (4), together with the fast convergence to zero of the gradients

 (6) ∫∞0t2∥∇Φ(x(t))∥2dt<+∞.
• For , we complete these results by showing that every trajectory converges weakly, with its limit belonging to . Moreover, we obtain a faster order of convergence .

• Also for , strong convergence is established in various practical situations. In particular, for the strongly convex case, we obtain an even faster speed of convergence which can be arbitrarily fast according to the choice of . More precisely, we have .

• A remarkable property of the system (DIN-AVD) is that these results can be naturally generalized to the non-smooth convex case. The key argument is that it can be reformulated as a first-order system (both in time and space) involving only the gradient and not the Hessian!

Time discretization of (DIN-AVD) provides new ideas for the design of innovative fast converging algorithms, expanding the field of rapid methods for structured convex minimization of Nesterov [29, 30, 31, 32], Beck-Teboulle [16], and Chambolle-Dossal [23]. This study, however, goes beyond the scope of this paper, and will be carried out in a future research. As briefly evoked above, the continuous (DIN-AVD) system is also linked to the modeling of non-elastic shocks in unilateral mechanics, and the geometric damping of nonlinear oscillators. These are important areas for applications, which are not considered in this paper.

## 1. Smooth potential

The following minimal hypotheses are in force in this section, and are always tacitly assumed:

• , ;

• is a twice continuously differentiable convex function; and

• 111Taking comes from the singularity of the damping coefficient at zero. Since we are only concerned about the asymptotic behaviour of the trajectories, the time origin is unimportant. If one insists in starting from , then all the results remain valid with ., , .

In view of minimizing , we study the asymptotic behaviour, as , of a solution to (DIN-AVD) second-order evolution equation (1). We will successively examine the following points:

• existence and uniqueness of a solution to (DIN-AVD) with Cauchy data and ;

• minimizing properties of and convergence of towards whenever ;

• fast convergence of towards , when the latter is attained and ;

• weak convergence of towards a minimum of and faster convergence of , when ;

• some cases of strong convergence of , and faster convergence of .

### 1.1. Existence and uniqueness of solution

The following result will be derived in Section 4 from a more general result concerning a convex lower semicontinuous function (see Corollary 4.6 below):

###### Theorem 1.1.

For any Cauchy data , (DIN-AVD) admits a unique twice continuously differentiable global solution verifying .

### 1.2. Lyapunov analysis and minimizing properties of the solutions for α>0

In this section, we present a family of Lyapunov functions for (DIN-AVD), and use them to derive the main properties of the solutions to this system. As we shall see, the fact that we have more than one (essentially different) of these functions will play a crucial role in establishing that the gradient vanishes as .

Let satisfy (DIN-AVD) with Cauchy data and , and let . Define by

 (7) Wθ(t)=Φ(x(t))+12∥˙x(t)+θ∇Φ(x(t))∥2+θ(β−θ)2∥∇Φ(x(t))∥2.

Observe that, for , we obtain

 W0(t)=Φ(x(t))+12∥˙x(t)∥2,

which is the usual global mechanical energy of the system. We shall see that, for each , is a strict Lyapunov function for (DIN-AVD).

In order to simplify the notation, write

 (8) uθ(t)=x(t)+θ∫tt0∇Φ(x(s))ds,

so that   and, for each ,

 (9) ˙uθ(t) = ˙x(t)+θ∇Φ(x(t)) Wθ(t) = Φ(x(t))+12∥˙uθ(t)∥2+θ(β−θ)2∥∇Φ(x(t))∥2.

Using (9) and (DIN-AVD), elementary computations yield

 (10) ¨uθ(t)=¨x(t)+θ∇2Φ(x(t))˙x(t)=−αt˙x(t)−∇Φ(x(t))+(θ−β)∇2Φ(x(t))˙x(t).

We have the following:

###### Proposition 1.2.

Let , and suppose is a solution of (DIN-AVD). Then, for each and , we have

 ˙Wθ(t)≤−α2t∥˙x(t)∥2−α2t∥˙uθ(t)∥2.
###### Proof.

First observe that

 (11) ˙Wθ(t)=⟨∇Φ(x(t)),˙x(t)⟩+⟨¨uθ(t),˙uθ(t)⟩+θ(β−θ)⟨∇2Φ(x(t))˙x(t),∇Φ(x(t))⟩.

Next, use (9) and (10) to obtain

 ⟨¨uθ(t),˙uθ(t)⟩ = ⟨−αt˙x(t)−∇Φ(x(t))+(θ−β)∇2Φ(x(t))˙x(t),˙x(t)+θ∇Φ(x(t))⟩ = −αt∥˙x(t)∥2−(αθt+1)⟨∇Φ(x(t)),˙x(t)⟩−θ∥∇Φ(x(t))∥2+(θ−β)⟨∇2Φ(x(t))˙x(t),˙x(t)⟩ +θ(θ−β)⟨∇2Φ(x(t))˙x(t),∇Φ(x(t))⟩ ≤ −αt∥˙x(t)∥2−(αθt+1)⟨∇Φ(x(t)),˙x(t)⟩−θ∥∇Φ(x(t))∥2+θ(θ−β)⟨∇2Φ(x(t))˙x(t),∇Φ(x(t))⟩,

since by convexity. From (11), we obtain

 (12) ˙Wθ(t)≤−αt∥˙x(t)∥2−αθt⟨∇Φ(x(t)),˙x(t)⟩−θ∥∇Φ(x(t))∥2.

On the one hand, when , it immediately follows that

 ˙W0(t)≤−αt∥˙x(t)∥2.

On the other hand, if , we use (9) in (12) to deduce that

 ˙Wθ(t)≤(2θ−αt)⟨˙uθ(t),˙x(t)⟩−1θ∥˙x(t)∥2−1θ∥˙uθ(t)∥2≤−α2t∥˙x(t)∥2−α2t∥˙uθ(t)∥2,

since by hypothesis, and the fact that for all . ∎

###### Theorem 1.3.

Let , and suppose is a solution of (DIN-AVD). Then

 limt→+∞W0(t)=limt→+∞Wβ(t)=limt→+∞Φ(x(t))=infΦ∈R∪{−∞}.
###### Proof.

Since we are interested in asymptotic properties of , we can assume throughout the proof. Take , so that the last term in the definition (7) of vanishes. Given , we define by

 h(t)=12∥uθ(t)−z∥2.

By the Chain Rule, we have

 ˙h(t)=⟨uθ(t)−z,˙uθ(t)⟩and¨h(t)=⟨uθ(t)−z,¨uθ(t)⟩+∥˙uθ(t)∥2.

On the other hand, from (9) and (10), we obtain

 (13) ¨uθ(t)+αt˙uθ(t)=(αθt−1)∇Φ(x(t))+(θ−β)∇2Φ(x(t))˙x(t).

Set

 I(t):=12∥∥∥∫tt0∇Φ(x(s))ds∥∥∥2andJ(t):=⟨x(t)−z,∇Φ(x(t))⟩−Φ(x(t)),

and observe that

 ˙I(t)=⟨∫tt0∇Φ(x(s))ds,∇Φ(x(t))⟩and˙J(t)=⟨x(t)−z,∇2Φ(x(t))˙x(t)⟩.

Next, since , we can write

 ¨h(t)+αt˙h(t) = ∥˙uθ(t)∥2+(αθt−1)⟨uθ(t)−z,∇Φ(x(t))⟩+(θ−β)⟨uθ(t)−z,∇2Φ(x(t))˙x(t)⟩ = ∥˙uθ(t)∥2+(αθt−1)⟨x(t)−z,∇Φ(x(t))⟩+θ(αθt−1)˙I(t)+(θ−β)˙J(t) ≤ ∥˙uθ(t)∥2+(αθt−1)(Φ(x(t))−Φ(z))+θ(αθt−1)˙I(t)+(θ−β)˙J(t),

where the last inequality follows from the convexity of and the fact that . Using the definition (7) of , and Proposition 1.2, we get

 ¨h(t)+αt˙h(t) = (32−αθ2t)∥˙uθ(t)∥2+(αθt−1)(Wθ(t)−Φ(z))+θ(αθt−1)˙I(t)+(θ−β)˙J(t) ≤

Dividing by and rearranging the terms, we have

Since , and are bounded from below, we can integrate this inequality from to , and use Lemma 7.3 to obtain such that

 (14) 1t˙h(t)+∫tt1(1s−αθs2)(Wθ(s)−Φ(z))ds≤−∫tt1(3α−θs)˙Wθ(s)ds+C.

Since is nonincreasing, we have

 (15) ∫tt1(1s−αθs2)(Wθ(s)−Φ(z))ds ≥ (Wθ(t)−Φ(z))∫tt1(1s−αθs2)ds = (Wθ(t)−Φ(z))(lnt−lnt1+αθt−αθt1).

In turn,

 (16) −∫tt1(3α−θs)˙Wθ(s)ds = (3α−θt1)(Wθ(t1)−Φ(z))−(3α−θt)(Wθ(t)−Φ(z))+θ∫tt1Wθ(s)−Φ(z)s2ds ≤ ≤ 3α∣∣Wθ(t1)−Φ(z)∣∣−(3α−θt)(Wθ(t)−Φ(z)),

since is nonincreasing and .

Combining (14), (15) and (16), we deduce that

 1t˙h(t)+(Wθ(t)−Φ(z))(lnt+D+Et)≤C′

for appropriate constants .

Now, take such that for all , and integrate from to to obtain

 h(t)t−h(t2)t2+∫tt2h(s)s2ds+(Wθ(t)−Φ(z))∫tt2(logs+D+Es)ds≤C′(t−t2).

Since is nonnegative, this implies

 −h(t2)t2+(Wθ(t)−Φ(z))(tlnt−t2lnt2+(D−1)(t−t2)+E(lnt−lnt2))ds≤C′(t−t2),

and so,

 (17) (Wθ(t)−Φ(z))(tlnt+(D−1)t+Elnt+F)ds≤C′t+G,

for some other constants . As , we obtain (the limit is in ). Since is arbitrary, and for all , the result follows. ∎

By the weak lower-semicontinuity of , Theorem 1.3 immediately yields the following:

###### Corollary 1.4.

Let , and suppose is a solution of (DIN-AVD). As , every sequential weak cluster point of belongs to . In particular, if does not tend to as , then .

If the function is bounded from below, we have the following stability result:

###### Proposition 1.5.

Let , and suppose is a solution of (DIN-AVD). If , then

 limt→+∞∥˙x(t)∥=limt→+∞∥∇Φ(x(t))∥=0,∫∞t01t∥˙x(t)∥2dt<+∞,% and∫∞t01t∥∇Φ(x(t))∥2dt<+∞.
###### Proof.

Theorem 1.3 establishes that . If is bounded below, the limits belong to . We deduce that

 ⎧⎪⎨⎪⎩limt→+∞∥˙x(t)∥2=2limt→+∞(W0(t)−Φ(x(t)))=0limt→+∞∥˙uβ(t)∥2=2limt→+∞(Wβ(t)−Φ(x(t)))=0.

By definition (8), we have , and so, . Finally, Proposition 1.2 gives

 ∫∞t1α2s(∥˙x(s)∥2+∥˙uβ(s)∥2)ds≤Wβ(t1)−infΦ<+∞.

It suffices to use again to complete the proof. ∎

###### Proposition 1.6.

Let , and suppose is a solution of (DIN-AVD). If , then

• , and ;

• .

###### Proof.

Fix .

For i), observe that . Next, use (17) with to conclude.

For ii), use in inequalities (14) and (16), and combine them to deduce that

 (18) 12tddt∥x(t)−^z∥2+(1−αθt1)∫tt1W0(s)−Φ(z)sds≤C′′

for and some other constant . On the other hand, since by Proposition 1.5, we have . It follows that

 1tddt∥x(t)−^z∥2≤1t∥˙x∥∞∥x(t)−^z∥≤∥˙x∥∞(∥x(t1)−^z∥t1+∥˙x∥∞)

by the Mean Value Theorem. From (18), we deduce that

 ∫tt1W0(s)−Φ(z)sds<+∞

which yields the result. ∎

###### Remark 1.7.

Most of the results in this section can be established without using the differentiability of and independently, but only that of , along with relations (9) and (10), and the chain rule . We shall develop these arguments in Section 4, when we deal with a nonsmooth potential.

### 1.3. Fast convergence of the values for α≥3

In this part we mainly analyze the fast convergence of the values of along a trajectory of (DIN-AVD). The value plays a special role: to our knowledge, it is the smallest for which fast convergence results are proved to hold.

Suppose and . Let be a solution of (DIN-AVD) with Cauchy data . For we define the function by

 (19) Eλ(t)=t(t−β(λ+2−α))(Φ(x(t))−minΦ)+12∥λ(x(t)−x∗)+t˙uβ(t)∥2+λ(α−λ−1)12∥x(t)−x∗∥2,

where is given by (8), with . To compute we first differentiate each term of in turn (we use (10) in the second derivative).

 ddt[t(t−β(λ+2−α))(Φ(x(t))−minΦ)]=(2t−β(λ+2−α))(Φ(x(t))−minΦ)+t(t−β(λ+2−α))⟨˙x(t),∇Φ(x(t))⟩;
 ddt12∥λ(x(t)−x∗)+t˙uβ(t)∥2 = ⟨λ(x(t)−x∗)+t˙uβ(t),λ˙x(t)+˙uβ(t)+t¨uβ(t)⟩ = ⟨λ(x(t)−x∗)+t˙uβ(t),(λ+1−α)˙x(t)−(t−β)∇Φ(x(t))⟩ = λ(λ+1−α)⟨x(t)−x∗,˙x(t)⟩−t(α−λ−1)∥˙x(t)∥2−βt(t−β)∥∇Φ(x(t))∥2 −λ(t−β)⟨x(t)−x∗,∇Φ(x(t))⟩−t(t−β(λ+2−α))⟨˙x(t),∇Φ(x(t))⟩;
 ddtλ(α−λ−1)12∥x(t)−x∗∥2 = λ(α−λ−1)⟨x(t)−x∗,˙x(t)⟩.

Whence

 ddtEλ(t) = (2t−β(λ+2−α))(Φ(x(t))−minΦ)−λ(t−β)⟨x(t)−x∗,∇Φ(x(t))⟩ −t(α−λ−1)∥˙x(t)∥2−βt(t−β)∥∇Φ(x(t))∥2.

Now, . If , we deduce, from (1.3), that

 (21) ddtEλ(t)≤−((λ−2)t−β(α−2))(Φ(x(t))−minΦ)−t(α−λ−1)∥˙x(t)∥2−βt(t−β)∥∇Φ(x(t))∥2.
###### Remark 1.8.

Recall that is nonnegative. Let us give a closer look at the coefficients on the right-hand side: First, for provided . Next, whenever . A compatibility condition for these two relations to hold is that , thus . The limiting case (thus ) will be included in Lemma 1.9 below. Finally, for . Summarizing, if , we immediately deduce that is nonincreasing on the interval , and exists.

###### Lemma 1.9.

Let and . Suppose is a solution of (DIN-AVD). If , then the function

 t↦(tt−β)α−2Eλ(t)

is nonincreasing and exists.

###### Proof.

Since we are interested in asymptotic properties of , we can assume . From (21) we deduce

 ddtEλ(t)≤β(α−2)(Φ(x(t))−minΦ).

Multiplying by and noticing we obtain

 t(t−β)ddtEλ(t) ≤ β(α−2)t(t−β)(Φ(x(t))−minΦ) ≤ β(α−2)t(t−β(λ+2−α))(Φ(x(t))−minΦ) ≤ β(α−2)Eλ(t).

Now, multiplying by we obtain

 (tt−β)α−2ddtEλ(t)≤β(α−2)tα−3(t−β)α−1Eλ(t),

whence we deduce

 ddt[(tt−β)α−2Eλ(t)]≤0.

Therefore, the function is nonincreasing. Since it is nonnegative, it has a limit as , and, clearly, so does . ∎

An important consequence is the following:

###### Theorem 1.10.

Let and . Suppose is a solution of (DIN-AVD). Then, is bounded. Moreover, set and . For all , we have

 Φ(x(t))−minΦ≤1t2(ss−β)α−2Eλ(s)=O(t−2).