Variable metric inexact line–search based methods for nonsmooth optimization ^{†}^{†}thanks: This work has been partially supported by MIUR under the two projects FIRB  Futuro in Ricerca 2012, contract RBFR12M3AC and PRIN 2012, contract 2012MTE38N. Ignace Loris is a Research Associate of the Fonds de la Recherche Scientifique  FNRS. The Italian GNCS  INdAM is also acknowledged.
Abstract
We develop a new proximal–gradient method for minimizing the sum of a differentiable, possibly nonconvex, function plus a convex, possibly non differentiable, function. The key features of the proposed method are the definition of a suitable descent direction, based on the proximal operator associated to the convex part of the objective function, and an Armijo–like rule to determine the step size along this direction ensuring the sufficient decrease of the objective function. In this frame, we especially address the possibility of adopting a metric which may change at each iteration and an inexact computation of the proximal point defining the descent direction. For the more general nonconvex case, we prove that all limit points of the iterates sequence are stationary, while for convex objective functions we prove the convergence of the whole sequence to a minimizer, under the assumption that a minimizer exists. In the latter case, assuming also that the gradient of the smooth part of the objective function is Lipschitz, we also give a convergence rate estimate, showing the complexity with respect to the function values. We also discuss verifiable sufficient conditions for the inexact proximal point and we present the results of a numerical experience on a convex total variation based image restoration problem, showing that the proposed approach is competitive with another stateoftheart method.
Key words. Proximal algorithms, nonsmooth optimization, generalized projection, nonconvex optimization.
AMS subject classifications. 65K05, 90C30
1 Introduction
In this paper we consider the problem
\hb@xt@.01(1.1) 
where is a proper, convex, lower semicontinuous function and is smooth, i.e. continuously differentiable, on an open subset of containing .
We also assume that is bounded from below and that is nonempty and closed. Formulation (LABEL:minf) includes also constrained problems over convex sets, which can be introduced by adding to the indicator function of the feasible set.
When in particular reduces to the indicator function of a convex set , i.e. with
a simple and well studied algorithm for the solution of (LABEL:minf) is the gradient projection (GP) method, which is particularly appealing for large scale problems. In the last years, several variants of such method have been proposed [7, 10, 18, 21], with the aim to accelerate the convergence which, for the basic implementation, can be very slow. In particular, reliable acceleration techniques have been proposed for the so called gradient projection method with line–search along the feasible direction [6, Chapter 2], whose iteration consists in
\hb@xt@.01(1.2) 
where is the Euclidean projection of the point onto the feasible set and is a steplength parameter ensuring the sufficient decrease of the objective function. Typically, is determined by means of a backtracking loop until an Armijotype inequality is satisfied. Variants of the basic scheme are obtained by introducing a further variable stepsize parameter , which controls the step along the gradient, in combination with a variable choice of the underlying metric. In practice, the point can be defined as
\hb@xt@.01(1.3) 
where is a positive parameter and is a symmetric positive definite matrix. The stepsizes and the matrices have to be considered as “free” parameters of the method and a clever choice of them can lead to significant improvements in the practical convergence behaviour [7, 8, 10].
In this paper we generalize the GP scheme (LABEL:intro0)–(LABEL:intro1), by introducing the concept of descent direction for the case where is a general convex function and we propose a suitable variant of the Armijo rule for the nonsmooth problem (LABEL:minf). In particular, we focus on the case when the descent direction has the form , with
\hb@xt@.01(1.4) 
where plays the role of a distance function, depending on the parameter . Clearly, (LABEL:intro2) is a generalization of (LABEL:intro1), which is recovered when , by setting , with .
Formally, the scheme (LABEL:intro0)(LABEL:intro2) is a forward–backward (or proximal gradient) method [15, 16] depending on the parameters , .
In particular, we deeply investigate the variant of the scheme (LABEL:intro0)–(LABEL:intro2) where the minimization problem in (LABEL:intro2) is solved inexactly and we devise two types of admissible approximations. We show that both approximation types can be practically computed when , where and is a proper, convex, lower semicontinuous function with an easytocompute resolvent operator. In this case, our scheme consists in a double loop method, where the inner loop is provided by an implementable stopping criterion. For general , we are able to prove that any limit point of the sequence generated by our inexact scheme is stationary for problem (LABEL:minf). The proof of this fact is essentially based on the properties of the Armijotype rule adopted for computing and it does not require any Lipschitz property of the gradient of . When is convex, we prove a stronger result, showing that the iterates converge to a minimizer of (LABEL:minf), if it exists. In the latter case, under the further assumption that is Lipschitz continuous, we give a convergence rate estimate for the objective function values. Our analysis includes as special cases several stateoftheart methods, as those in [7, 9, 10, 26, 32].
Forward–backward algorithms based on a variable metric have been recently studied also in [14] for the convex case and in [13] for the nonconvex case under the KurdykaŁojasiewicz assumption (see also [20]). Even if our scheme is formally very similar to those in [13, 14], the involved parameters have a substantially different meaning. In our case, the theoretical convergence is ensured by the Armijo parameter in combination with the descent direction properties; this results in an almost complete freedom to choose the other algorithm parameters (e.g. and ), without necessarily relating them to the Lipschitz constant of (actually, our analysis, except the convergence rate estimate, is performed without this assumption). We believe that this is also one of the main strength of our method, since acceleration techniques based on suitable choices of and , originally proposed for smooth optimization, can be adopted, leading to an improvement of the practical performances. The other crucial ingredient of our method is the inexact computation of the minimizer in (LABEL:intro2): this issue has been considered in several papers in the context of proximal and proximal gradient methods (see for example [1, 13, 31, 33] and references therein). The approach we follow in this paper is more similar to the one proposed in [33] and has the advantage to provide an implementable condition for the approximate computation of the proximal point. Moreover, we also generalize the ideas proposed in [7] for the inexact computation of the projection onto a convex set. Finally, we also mention the papers [2, 3, 4, 19] for the use of non Euclidean distances in the context of forward–backward and proximal methods.
The paper is organized as follows: some background material is collected in Section LABEL:sec:definitions, while the concept of descent direction for problem (LABEL:minf) is presented and developed in Section LABEL:sec:descent. In Section LABEL:sec:algorithm, the modified Armijo rule is discussed. Then, a general convergence result for line–search descent algorithms based on this rule is proved, in the nonconvex case. Two different inexactness criteria, called of type and type are proposed in Sections LABEL:sec:epsilonapprox and LABEL:sec:etaapprox, and the related implementation is discussed in Sections LABEL:sec:etaimplementation and LABEL:sec:epsimplementation. Section LABEL:sec:convergence deals with the convex case, where the convergence of an approximation based algorithm is proved and the related convergence rate is analyzed. The results of a numerical experience on a total variation based image restoration problem are presented in Section LABEL:sec:numerical while our conclusions are given in Section LABEL:sec:conclusions.
Notation
We denote the extended real numbers set as and by , the set of nonnegative and positive real numbers, respectively. The scaled Euclidean norm of an vector , associated to a symmetric positive definite matrix is . Given , we denote by the set of all symmetric positive definite matrices with all eigenvalues contained in the interval . For any we have that also belongs to and
\hb@xt@.01(1.5) 
for any .
2 Definitions and basic properties
We recall the following definitions.
Definition 2.1
[29, p.213] Let be any function from to . The one sided directional derivative of at with respect to a vector is defined as
\hb@xt@.01(2.1) 
if the limit on the righthand side exists in .
When is smooth at , then . When is convex, its directional derivative has the following property.
Theorem 2.1
[29, Theorem 23.1] If is convex and , then for any the limit at the righthand side of (LABEL:dirder) exists and .
As a consequence of the previous theorem, for any convex function we have that exists for any , and
\hb@xt@.01(2.2) 
Definition 2.2
[32, p. 394] A point is stationary for problem (LABEL:minf) if and
\hb@xt@.01(2.3) 
Definition 2.3
[20, §2.3] The proximity or resolvent operator associated to a convex function in the metric induced by a symmetric positive definite matrix is defined as
We remark that is a Lipschitz continuous function whose Lipschitz constant is .
Definition 2.4
Let be a convex function. The conjugate function of is the function defined as .
The following proposition states a useful property of the conjugate.
Proposition 2.1
Let , be two convex functions, .If , then .
Proof. By Definition LABEL:def:dual we have
Definition 2.5
[35, p. 82] Given , the subdifferential of a convex function at a point is the set
\hb@xt@.01(2.4) 
If , then . For the usual subdifferential set is recovered. A useful property of the subdifferential is the following one.
Proposition 2.2
[35, Theorem 2.4.4 (iv)] Let be a convex, proper, lower semicontinuous function. Then for any and for any we have .
3 A family of descent directions
When is smooth, a vector is said a descent direction for at when . In the nonsmooth case (LABEL:minf), we give the following definition, based on the directional derivative.
Definition 3.1
A vector is a descent direction for at if .
Thanks to Theorem LABEL:teo:directionalconvex, the previous definition is well posed. In this section we define a family of descent directions for problem (LABEL:minf). To this end, we define the following set of non–negative functions.
Given a convex set and a set of parameters , we denote by the set of any distance–like function continuously depending on such that for all we have:

is continuous in ;

is smooth w.r.t. ;

is strongly convex w.r.t. :
where does not depend on or (here denotes the gradient with respect to the first argument of a function);

if and only if (which implies that for all ).
The scaled Euclidean distance
\hb@xt@.01(3.1) 
with , where and is a symmetric positive definite matrix, is an interesting example of a function in . Other examples of distance–like functions can be obtained by considering Bregman distances associated to a strongly convex function.
For a given array of parameters , let us introduce the function defined as
\hb@xt@.01(3.2) 
where and . We remark that depends continuously on , as does. Moreover, since and are convex, proper and lower semicontinuous, is also convex, proper and lower semicontinuous for all . Finally, for any point and for any we have
\hb@xt@.01(3.3) 
where denotes the directional derivative of at the point with respect to . From assumption LABEL:diststrongcvx, it follows that is strongly convex and admits a unique minimum point for any .
Now we introduce the following operator associated to any function of the form (LABEL:hxy)
\hb@xt@.01(3.4) 
When is chosen as in (LABEL:scaledeuclidean), the operator (LABEL:proj) becomes
Under assumption LABEL:diststrongcvx, one can show that depends continuously on .
Proposition 3.1
Let and be defined as in (LABEL:hxy). Then depends continuously on .
Proof. Let . Then is characterized by the equation , where . It follows that for all or:
Assumption LABEL:diststrongcvx expressed in and gives:
Together, these two inequalities yield:
Let and . Adding the previous inequality for (resp. ) and choosing (resp. ), one finds:
and hence:
It follows that where and . As is , one has . As is continuous in , one also has that . This shows then that , in other words is continuous in .
Given a function , we introduce also the function defined as
\hb@xt@.01(3.5) 
for some . We have
\hb@xt@.01(3.6) 
and when . In the following we will show that

the stationarity condition (LABEL:stationary) can be reformulated in terms of fixed points of the operator ;

the negative sign of detects a descent direction.
To this purpose, we collect in the following proposition some properties of the function and the associated operator .
Proposition 3.2
Let , , and , be defined as in (LABEL:hxy), (LABEL:thxy), where . If and , then:

;

if and , then ;

and if and only if ;

and the equality holds if and only if (if and only if ).
Proof. (a) is a direct consequence of definition (LABEL:thxy) and condition on .
(b) If , we have
where the second inequality follows from definition (LABEL:thxy) of and the third one from (LABEL:ine_dir_der).
(c) Since is the minimum point of , part (a) with yields which, in view of (LABEL:IL1), gives . If , part (a) implies . Conversely, assume . From inequality (LABEL:IL1) we have . On the other side, since is the minimum point of , part (a) with implies . Thus and since is the unique minimizer of , we can conclude that .
(d) From (c) we have . When then part (b) implies . When , from (c) we obtain and, therefore, . Conversely, assume . This implies
Since , we necessarily have .
The following proposition completely characterizes the stationary points of (LABEL:minf) in two equivalent ways, as fixed points of the operator , i.e. the solutions of the equation , or as roots of the composite function .
Proposition 3.3
Let , , , be defined as in (LABEL:hxy), , and . The following statements are equivalent:

is stationary for problem (LABEL:minf);

;

.
Proof. (a) (b) Assume that . Then, achieves its minimum at and inequality (LABEL:stationary) applied to it yields . Recalling (LABEL:H2) we have , hence is a stationary point for problem (LABEL:minf).
Conversely, let be a stationary point of (LABEL:minf) and assume by contradiction that . Then, by Proposition LABEL:Proinexact1 (d) we obtain , which contradicts the stationarity assumption on .
(b) (c) See Proposition LABEL:Proinexact1 (c).
4 A line–search algorithm based on a modified Armijo rule
In this section we consider the modified Armijo rule described in Algorithm LABEL:Algo2, which is a generalization of the one in [32]. Indeed the rule proposed in [32] is recovered when is chosen as in (LABEL:scaledeuclidean) and . In the following we will prove that Algorithm LABEL:Algo2 is well defined and classical properties of the Armijo condition still hold for this modified case.
Let , be two sequences of points in , and be a sequence of parameters in . Choose some , . For all compute as follows:

Set and .

If
\hb@xt@.01(4.1) where
\hb@xt@.01(4.2) Then go to step 3.
Else set and go to step 2. 
End
Here and in the following we will define the function as in (LABEL:hxy) and, for sake of simplicity, we will make the following assumption

, where and is a compact set.
Proposition 4.1
Let , be two sequences of points in , a sequence of parameters in and . Assume that
\hb@xt@.01(4.3) 
for all . Then, the line–search Algorithm LABEL:Algo2 is well defined, i.e. for each the loop at step 2 terminates in a finite number of steps. If, in addition, we assume that and are bounded sequences and , then we have that is bounded. Assuming also that
\hb@xt@.01(4.4) 
where and are computed with Algorithm LABEL:Algo2, then we have
Proof. We prove first that the loop at step 2 of Algorithm LABEL:Algo2 terminates in a finite number of steps for any . Assume by contradiction that there exists a such that Algorithm LABEL:Algo2 performs an infinite number of reductions, thus, for any , we have
where the second inequality is obtained by means of the Jensen inequality applied to the convex function . Taking limits on the right hand side for we obtain
where the second inequality follows from the non–negativity of and the last one from (LABEL:A2). Since , this is an absurdum.
Assume now that , are bounded sequences and that . We show that is bounded. By assumption (LABEL:A2), is bounded from above. We show that it is also bounded from below. Indeed we have
where the first inequality follows from the non–negativity of , the second one is obtained by adding and subtracting and the last one is a consequence of .
As is proper and convex, there exists a supporting hyperplane, i.e. such that for all . Thus:
The right hand side is a continuous function of and . As these are assumed to lie on a closed and bounded set, the left hand side is bounded (from below) as well.
Let us show that the only limit point of is zero. We observe that from (LABEL:A2) and (LABEL:A3) we obtain
\hb@xt@.01(4.5) 
Assume that there exists a subset of indices such that