Variable metric inexact line–search based methods for nonsmooth optimization This work has been partially supported by MIUR under the two projects FIRB - Futuro in Ricerca 2012, contract RBFR12M3AC and PRIN 2012, contract 2012MTE38N. Ignace Loris is a Research Associate of the Fonds de la Recherche Scientifique - FNRS. The Italian GNCS - INdAM is also acknowledged.

# Variable metric inexact line–search based methods for nonsmooth optimization ††thanks: This work has been partially supported by MIUR under the two projects FIRB - Futuro in Ricerca 2012, contract RBFR12M3AC and PRIN 2012, contract 2012MTE38N. Ignace Loris is a Research Associate of the Fonds de la Recherche Scientifique - FNRS. The Italian GNCS - INdAM is also acknowledged.

S. Bonettini111Dipartimento di Matematica e Informatica, Università di Ferrara, Via Saragat 1, 44122 Ferrara, Italy (silvia.bonettini@unife.it,federica.porta@unife.it).    I. Loris222Département de Mathématique, Université Libre de Bruxelles, Boulevard du Triomphe, 1050 Bruxelles, Belgium (igloris@ulb.ac.be).    F. Porta111Dipartimento di Matematica e Informatica, Università di Ferrara, Via Saragat 1, 44122 Ferrara, Italy (silvia.bonettini@unife.it,federica.porta@unife.it).    M. Prato333Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, Via Campi 213/b, 41125 Modena, Italy (marco.prato@unimore.it).
###### Abstract

We develop a new proximal–gradient method for minimizing the sum of a differentiable, possibly nonconvex, function plus a convex, possibly non differentiable, function. The key features of the proposed method are the definition of a suitable descent direction, based on the proximal operator associated to the convex part of the objective function, and an Armijo–like rule to determine the step size along this direction ensuring the sufficient decrease of the objective function. In this frame, we especially address the possibility of adopting a metric which may change at each iteration and an inexact computation of the proximal point defining the descent direction. For the more general nonconvex case, we prove that all limit points of the iterates sequence are stationary, while for convex objective functions we prove the convergence of the whole sequence to a minimizer, under the assumption that a minimizer exists. In the latter case, assuming also that the gradient of the smooth part of the objective function is Lipschitz, we also give a convergence rate estimate, showing the complexity with respect to the function values. We also discuss verifiable sufficient conditions for the inexact proximal point and we present the results of a numerical experience on a convex total variation based image restoration problem, showing that the proposed approach is competitive with another state-of-the-art method.

Key words. Proximal algorithms, nonsmooth optimization, generalized projection, nonconvex optimization.

AMS subject classifications. 65K05, 90C30

## 1 Introduction

In this paper we consider the problem

 minx∈Rnf(x)≡f0(x)+f1(x) \hb@xt@.01(1.1)

where is a proper, convex, lower semicontinuous function and is smooth, i.e. continuously differentiable, on an open subset of containing .

We also assume that is bounded from below and that is non-empty and closed. Formulation (LABEL:minf) includes also constrained problems over convex sets, which can be introduced by adding to the indicator function of the feasible set.

When in particular reduces to the indicator function of a convex set , i.e. with

 ιΩ(x)={0 if x∈Ω+∞ if x∉Ω.,

a simple and well studied algorithm for the solution of (LABEL:minf) is the gradient projection (GP) method, which is particularly appealing for large scale problems. In the last years, several variants of such method have been proposed [7, 10, 18, 21], with the aim to accelerate the convergence which, for the basic implementation, can be very slow. In particular, reliable acceleration techniques have been proposed for the so called gradient projection method with line–search along the feasible direction [6, Chapter 2], whose iteration consists in

 x(k+1)=x(k)+λ(k)(y(k)−x(k)), \hb@xt@.01(1.2)

where is the Euclidean projection of the point onto the feasible set and is a steplength parameter ensuring the sufficient decrease of the objective function. Typically, is determined by means of a backtracking loop until an Armijo-type inequality is satisfied. Variants of the basic scheme are obtained by introducing a further variable stepsize parameter , which controls the step along the gradient, in combination with a variable choice of the underlying metric. In practice, the point can be defined as

 y(k)=argminy∈Ω∇f0(x(k))T(y−x(k))+12αk(y−x(k))TDk(y−x(k)) \hb@xt@.01(1.3)

where is a positive parameter and is a symmetric positive definite matrix. The stepsizes and the matrices have to be considered as “free” parameters of the method and a clever choice of them can lead to significant improvements in the practical convergence behaviour [7, 8, 10].

In this paper we generalize the GP scheme (LABEL:intro0)–(LABEL:intro1), by introducing the concept of descent direction for the case where is a general convex function and we propose a suitable variant of the Armijo rule for the nonsmooth problem (LABEL:minf). In particular, we focus on the case when the descent direction has the form , with

 y(k)=argminy∈Rn∇f0(x(k))T(y−x(k))+dσ(k)(y,x(k))+f1(y)−f1(x(k)), \hb@xt@.01(1.4)

where plays the role of a distance function, depending on the parameter . Clearly, (LABEL:intro2) is a generalization of (LABEL:intro1), which is recovered when , by setting , with .

Formally, the scheme (LABEL:intro0)-(LABEL:intro2) is a forward–backward (or proximal gradient) method [15, 16] depending on the parameters , .

In particular, we deeply investigate the variant of the scheme (LABEL:intro0)–(LABEL:intro2) where the minimization problem in (LABEL:intro2) is solved inexactly and we devise two types of admissible approximations. We show that both approximation types can be practically computed when , where and is a proper, convex, lower semicontinuous function with an easy-to-compute resolvent operator. In this case, our scheme consists in a double loop method, where the inner loop is provided by an implementable stopping criterion. For general , we are able to prove that any limit point of the sequence generated by our inexact scheme is stationary for problem (LABEL:minf). The proof of this fact is essentially based on the properties of the Armijo-type rule adopted for computing and it does not require any Lipschitz property of the gradient of . When is convex, we prove a stronger result, showing that the iterates converge to a minimizer of (LABEL:minf), if it exists. In the latter case, under the further assumption that is Lipschitz continuous, we give a convergence rate estimate for the objective function values. Our analysis includes as special cases several state-of-the-art methods, as those in [7, 9, 10, 26, 32].

Forward–backward algorithms based on a variable metric have been recently studied also in [14] for the convex case and in [13] for the nonconvex case under the Kurdyka-Łojasiewicz assumption (see also [20]). Even if our scheme is formally very similar to those in [13, 14], the involved parameters have a substantially different meaning. In our case, the theoretical convergence is ensured by the Armijo parameter in combination with the descent direction properties; this results in an almost complete freedom to choose the other algorithm parameters (e.g. and ), without necessarily relating them to the Lipschitz constant of (actually, our analysis, except the convergence rate estimate, is performed without this assumption). We believe that this is also one of the main strength of our method, since acceleration techniques based on suitable choices of and , originally proposed for smooth optimization, can be adopted, leading to an improvement of the practical performances. The other crucial ingredient of our method is the inexact computation of the minimizer in (LABEL:intro2): this issue has been considered in several papers in the context of proximal and proximal gradient methods (see for example [1, 13, 31, 33] and references therein). The approach we follow in this paper is more similar to the one proposed in [33] and has the advantage to provide an implementable condition for the approximate computation of the proximal point. Moreover, we also generalize the ideas proposed in [7] for the inexact computation of the projection onto a convex set. Finally, we also mention the papers [2, 3, 4, 19] for the use of non Euclidean distances in the context of forward–backward and proximal methods.

The paper is organized as follows: some background material is collected in Section LABEL:sec:definitions, while the concept of descent direction for problem (LABEL:minf) is presented and developed in Section LABEL:sec:descent. In Section LABEL:sec:algorithm, the modified Armijo rule is discussed. Then, a general convergence result for line–search descent algorithms based on this rule is proved, in the nonconvex case. Two different inexactness criteria, called of -type and -type are proposed in Sections LABEL:sec:epsilon-approx and LABEL:sec:eta-approx, and the related implementation is discussed in Sections LABEL:sec:eta-implementation and LABEL:sec:eps-implementation. Section LABEL:sec:convergence deals with the convex case, where the convergence of an -approximation based algorithm is proved and the related convergence rate is analyzed. The results of a numerical experience on a total variation based image restoration problem are presented in Section LABEL:sec:numerical while our conclusions are given in Section LABEL:sec:conclusions.

##### Notation

We denote the extended real numbers set as and by , the set of non-negative and positive real numbers, respectively. The scaled Euclidean norm of an -vector , associated to a symmetric positive definite matrix is . Given , we denote by the set of all symmetric positive definite matrices with all eigenvalues contained in the interval . For any we have that also belongs to and

 1μ∥x∥2≤∥x∥2D≤μ∥x∥2 \hb@xt@.01(1.5)

for any .

## 2 Definitions and basic properties

We recall the following definitions.

###### Definition 2.1

[29, p.213] Let be any function from to . The one sided directional derivative of at with respect to a vector is defined as

 f′(x;d)=limλ↓0f(x+λd)−f(x)λ \hb@xt@.01(2.1)

if the limit on the right-hand side exists in .

When is smooth at , then . When is convex, its directional derivative has the following property.

###### Theorem 2.1

[29, Theorem 23.1] If is convex and , then for any the limit at the right-hand side of (LABEL:dir-der) exists and .

As a consequence of the previous theorem, for any convex function we have that exists for any , and

 f′(x;d)≤f(x+d)−f(x). \hb@xt@.01(2.2)
###### Definition 2.2

[32, p. 394] A point is stationary for problem (LABEL:minf) if and

 f′(x;d)≥0   ∀d∈Rn. \hb@xt@.01(2.3)
###### Definition 2.3

[20, §2.3] The proximity or resolvent operator associated to a convex function in the metric induced by a symmetric positive definite matrix is defined as

 proxDf(x)=argminz∈Rnf(z)+12∥z−x∥2D,   ∀x∈Rn.

We remark that is a Lipschitz continuous function whose Lipschitz constant is .

###### Definition 2.4

Let be a convex function. The conjugate function of is the function defined as .

The following proposition states a useful property of the conjugate.

###### Proposition 2.1

Let , be two convex functions, .If , then .

Proof. By Definition LABEL:def:dual we have

 f∗(ATy)=supx∈Rn xTATy−f(x)=supx∈Rn (Ax)Ty−g(Ax)=supz∈Rm,z=AxzTy−g(z)≤supz∈RmzTy−g(z)=g∗(y).

###### Definition 2.5

[35, p. 82] Given , the -subdifferential of a convex function at a point is the set

 ∂ϵf(z)={w∈Rn:f(x)≥f(z)+(x−z)Tw−ϵ,  ∀x∈Rn}. \hb@xt@.01(2.4)

If , then . For the usual subdifferential set is recovered. A useful property of the -subdifferential is the following one.

###### Proposition 2.2

[35, Theorem 2.4.4 (iv)] Let be a convex, proper, lower semicontinuous function. Then for any and for any we have .

## 3 A family of descent directions

When is smooth, a vector is said a descent direction for at when . In the nonsmooth case (LABEL:minf), we give the following definition, based on the directional derivative.

###### Definition 3.1

A vector is a descent direction for at if .

Thanks to Theorem LABEL:teo:directional-convex, the previous definition is well posed. In this section we define a family of descent directions for problem (LABEL:minf). To this end, we define the following set of non–negative functions.

Given a convex set and a set of parameters , we denote by the set of any distance–like function continuously depending on such that for all we have:

1. is continuous in ;

2. is smooth w.r.t. ;

3. is strongly convex w.r.t. :

 dσ(z2,x)≥dσ(z1,x)+∇1dσ(z1,x)T(z2−z1)+m2∥z2−z1∥2∀z1,z2∈Ω,

where does not depend on or (here denotes the gradient with respect to the first argument of a function);

4. if and only if (which implies that for all ).

The scaled Euclidean distance

 dσ(x,y)=12α∥x−y∥2D \hb@xt@.01(3.1)

with , where and is a symmetric positive definite matrix, is an interesting example of a function in . Other examples of distance–like functions can be obtained by considering Bregman distances associated to a strongly convex function.

For a given array of parameters , let us introduce the function defined as

 hσ(z,x)=∇f0(x)T(z−x)+dσ(z,x)+f1(z)−f1(x)  ∀z,x∈Rn, \hb@xt@.01(3.2)

where and . We remark that depends continuously on , as does. Moreover, since and are convex, proper and lower semicontinuous, is also convex, proper and lower semicontinuous for all . Finally, for any point and for any we have

 h′σ(x,x;d)=f′(x;d), \hb@xt@.01(3.3)

where denotes the directional derivative of at the point with respect to . From assumption LABEL:diststrongcvx, it follows that is strongly convex and admits a unique minimum point for any .

Now we introduce the following operator associated to any function of the form (LABEL:hxy)

 p(x;hσ)=argminz∈Rnhσ(z,x). \hb@xt@.01(3.4)

When is chosen as in (LABEL:scaled-euclidean), the operator (LABEL:proj) becomes

 p(x;hσ)=proxDαf1(x−αD−1∇f0(x)).

Under assumption LABEL:diststrongcvx, one can show that depends continuously on .

###### Proposition 3.1

Let and be defined as in (LABEL:hxy). Then depends continuously on .

Proof. Let . Then is characterized by the equation , where . It follows that for all or:

 f1(u)≥f1(y)−(∇f0(x)+∇1dσ(y,x))T(u−y)∀u∈Rn.

Assumption LABEL:diststrongcvx expressed in and gives:

 dσ(u,x)≥dσ(y,x)+∇1dσ(y,x)T(u−y)+m2∥y−u∥2∀u∈Rn.

Together, these two inequalities yield:

 m2∥y−u∥2≤f1(u)−f1(y)+dσ(u,x)−dσ(y,x)+∇f0(x)T(u−y)∀u∈Rn.

Let and . Adding the previous inequality for (resp. ) and choosing (resp. ), one finds:

 m∥y1−y2∥2≤dσ1(y2,x1)−dσ1(y1,x1)+dσ2(y1,x2)−dσ2(y2,x2)+(∇f0(x1)−∇f0(x2))T(y2−y1)

and hence:

 m∥y1−y2∥2≤dσ2(y1,x2)−dσ1(y1,x1)+dσ1(y2,x1)−dσ2(y2,x2)+∥∇f0(x1)−∇f0(x2)∥∥y2−y1∥.

It follows that where and . As is , one has . As is continuous in , one also has that . This shows then that , in other words is continuous in .
Given a function , we introduce also the function defined as

 ~hσ,γ(z,x)=∇f0(x)T(z−x)+γdσ(z,x)+f1(z)−f1(x)  ∀z,x∈Rn \hb@xt@.01(3.5)

for some . We have

 ~hσ,γ(y,x)≤hσ(y,x)   ∀x,y∈Rn \hb@xt@.01(3.6)

and when . In the following we will show that

• the stationarity condition (LABEL:stationary) can be reformulated in terms of fixed points of the operator ;

• the negative sign of detects a descent direction.

To this purpose, we collect in the following proposition some properties of the function and the associated operator .

###### Proposition 3.2

Let , , and , be defined as in (LABEL:hxy), (LABEL:thxy), where . If and , then:

• ;

• if and , then ;

• and if and only if ;

• and the equality holds if and only if (if and only if ).

Proof. (a) is a direct consequence of definition (LABEL:thxy) and condition on .

(b) If , we have

 0≥−γdσ(z,x)>∇f0(x)T(z−x)+f1(z)−f1(x)≥∇f0(x)T(z−x)+f′1(x;z−x)=f′(x;z−x),

where the second inequality follows from definition (LABEL:thxy) of and the third one from (LABEL:ine_dir_der).

(c) Since is the minimum point of , part (a) with yields which, in view of (LABEL:IL1), gives . If , part (a) implies . Conversely, assume . From inequality (LABEL:IL1) we have . On the other side, since is the minimum point of , part (a) with implies . Thus and since is the unique minimizer of , we can conclude that .

(d) From (c) we have . When then part (b) implies . When , from (c) we obtain and, therefore, . Conversely, assume . This implies

 0=∇f0(x)T(y−x)+f′1(x;y−x)≤∇f0(x)T(y−x)+f1(y)−f1(x)≤~hσ,γ(y,x).

Since , we necessarily have .
The following proposition completely characterizes the stationary points of (LABEL:minf) in two equivalent ways, as fixed points of the operator , i.e. the solutions of the equation , or as roots of the composite function .

###### Proposition 3.3

Let , , , be defined as in (LABEL:hxy), , and . The following statements are equivalent:

• is stationary for problem (LABEL:minf);

• ;

• .

Proof. (a) (b) Assume that . Then, achieves its minimum at and inequality (LABEL:stationary) applied to it yields . Recalling (LABEL:H2) we have , hence is a stationary point for problem (LABEL:minf).

Conversely, let be a stationary point of (LABEL:minf) and assume by contradiction that . Then, by Proposition LABEL:Proinexact1 (d) we obtain , which contradicts the stationarity assumption on .

(b) (c) See Proposition LABEL:Proinexact1 (c).

## 4 A line–search algorithm based on a modified Armijo rule

In this section we consider the modified Armijo rule described in Algorithm LABEL:Algo2, which is a generalization of the one in [32]. Indeed the rule proposed in [32] is recovered when is chosen as in (LABEL:scaled-euclidean) and . In the following we will prove that Algorithm LABEL:Algo2 is well defined and classical properties of the Armijo condition still hold for this modified case.

Here and in the following we will define the function as in (LABEL:hxy) and, for sake of simplicity, we will make the following assumption

• , where and is a compact set.

###### Proposition 4.1

Let , be two sequences of points in , a sequence of parameters in and . Assume that

 ~hσ(k),γ(~y(k),x(k))<0 \hb@xt@.01(4.3)

for all . Then, the line–search Algorithm LABEL:Algo2 is well defined, i.e. for each the loop at step 2 terminates in a finite number of steps. If, in addition, we assume that and are bounded sequences and , then we have that is bounded. Assuming also that

 limk→∞f(x(k))−f(x(k)+λ(k)d(k))=0, \hb@xt@.01(4.4)

where and are computed with Algorithm LABEL:Algo2, then we have

 limk→∞~hσ(k),γ(~y(k),x(k))=0.

Proof. We prove first that the loop at step 2 of Algorithm LABEL:Algo2 terminates in a finite number of steps for any . Assume by contradiction that there exists a such that Algorithm LABEL:Algo2 performs an infinite number of reductions, thus, for any , we have

 βΔ(k) < f(x(k)+δjd(k))−f(x(k))δj = f0(x(k)+δjd(k))−f0(x(k))δj+f1(x(k)+δjd(k))−f1(x(k))δj ≤ f0(x(k)+δjd(k))−f0(x(k))δj+δjf1(x(k)+d(k))+(1−δj)f1(x(k))−f1(x(k))δj = f0(x(k)+δjd(k))−f0(x(k))δj+f1(~y(k))−f1(x(k)),

where the second inequality is obtained by means of the Jensen inequality applied to the convex function . Taking limits on the right hand side for we obtain

 βΔ(k) ≤ ∇f0(x(k))Td(k)+f1(~y(k))−f1(x(k)) ≤ ∇f0(x(k))Td(k)+f1(~y(k))−f1(x(k))+γdσ(k)(~y(k),x(k)) = Δ(k)<0,

where the second inequality follows from the non–negativity of and the last one from (LABEL:A2). Since , this is an absurdum.

Assume now that , are bounded sequences and that . We show that is bounded. By assumption (LABEL:A2), is bounded from above. We show that it is also bounded from below. Indeed we have

 ~hσ(k),γ(~y(k),x(k)) = ∇f0(x(k))T(~y(k)−x(k))+γdσ(k)(~y(k),x(k))+f1(~y(k))−f1(x(k)) ≥ ∇f0(x(k))T(~y(k)−x(k))+f1(~y(k))−f1(x(k)) = ∇f0(x(k))T(~y(k)−x(k))+f1(~y(k))−f(x(k))+f0(x(k)) ≥ ∇f0(x(k))T(~y(k)−x(k))+f1(~y(k))−f(x(0))+f0(x(k)),

where the first inequality follows from the non–negativity of , the second one is obtained by adding and subtracting and the last one is a consequence of .

As is proper and convex, there exists a supporting hyperplane, i.e. such that for all . Thus:

 ~hσ(k),γ(~y(k),x(k))≥∇f0(x(k))T(~y(k)−x(k))+aT~y(k)+b−f(x(0))+f0(x(k)).

The right hand side is a continuous function of and . As these are assumed to lie on a closed and bounded set, the left hand side is bounded (from below) as well.

Let us show that the only limit point of is zero. We observe that from (LABEL:A2) and (LABEL:A3) we obtain

 0=limk→∞f(x(k))−f(x(k)+λ(k)d(k))=βlimk→∞Δ(k)λ(k). \hb@xt@.01(4.5)

Assume that there exists a subset of indices such that