Forward-backward truncated Newton methods for convex composite optimization1
This paper proposes two proximal Newton-CG methods for convex nonsmooth optimization problems in composite form. The algorithms are based on a a reformulation of the original nonsmooth problem as the unconstrained minimization of a continuously differentiable function, namely the forward-backward envelope (FBE). The first algorithm is based on a standard line search strategy, whereas the second one combines the global efficiency estimates of the corresponding first-order methods, while achieving fast asymptotic convergence rates. Furthermore, they are computationally attractive since each Newton iteration requires the approximate solution of a linear system of usually small dimension.
The focus of this work is on efficient Newton-like algorithms for convex optimization problems in composite form, i.e.,
where 222: class of twice continuously differentiable, strongly convex functions with modulus of strong convexity , whose gradient is Lipschitz continuous with constant . and 333: class of proper, lower semicontinuous, convex functions from to . has a cheaply computable proximal mapping . Problems of the form (1.1) are abundant in many scientific areas such as control, signal processing, system identification, machine learning and image analysis, to name a few. For example, when is the indicator of a convex set then (1.1) becomes a constrained optimization problem, while for and it becomes the -regularized least-squares problem which is the main building block of compressed sensing. When is equal to the nuclear norm, then problem (1.1) can model low-rank matrix recovery problems. Finally, conic optimization problems such as LPs, SOCPs and SPDs can be brought into the form of (1.1), see .
Perhaps the most well known algorithm for problems in the form (1.1) is the forward-backward splitting (FBS) or proximal gradient method [4, 5], a generalization of the classical gradient and gradient projection methods to problems involving a nonsmooth term. Accelerated versions of FBS, based on the work of Nesterov [6, 7, 8], have also gained popularity. Although these algorithms share favorable global convergence rate estimates of order or (where is the solution accuracy), they are first-order methods and therefore usually effective at computing solutions of low or medium accuracy only. An evident remedy is to include second-order information by replacing the Euclidean norm in the proximal mapping with the -norm, where is the Hessian of at or some approximation of it, mimicking Newton or quasi-Newton methods for unconstrained problems. This route is followed in the recent work of [9, 10]. However, a severe limitation of the approach is that, unless has a special structure, the linearized subproblem is very hard to solve. For example, if models a QP, the corresponding subproblem is as hard as the original problem.
In this paper we follow a different approach by defining a function, which we call forward-backward envelope (FBE), that has favorable properties and can serve as a real-valued, smooth, exact penalty function for the original problem. Our approach combines and extends ideas stemming from the literature on merit functions for variational inequalities (VIs) and complementarity problems (CPs), specifically the reformulation of a VI as a constrained continuously differentiable optimization problem via the regularized gap function  and as an unconstrained continuously differentiable optimization problem via the D-gap function  (see [13, Ch. 10] for a survey and ,  for applications to constrained optimization and model predictive control of dynamical systems).
Next, we show that one can design Newton-like methods to minimize the FBE by using tools from nonsmooth analysis. Unlike the approaches of [9, 10], where the corresponding subproblems are expensive to solve, our algorithms require only the solution of a usually small linear system to compute the Newton direction. However, this work focuses on devising algorithms that have good complexity guarantees provided by a global (non-asymptotic) convergence rate while achieving -superlinear or -quadratic444A sequence converges to with -superlinear rate if . It converges to with -quadratic rate if there exists a such that , for some and all . asymptotic convergence rates in the nondegenerate cases. We show that one can achieve this goal by interleaving Newton-like iterations on the FBE and FBS iterations. This is possible by relating directions of descent for the considered penalty function with those for the original nonsmooth function.
The main contributions of the paper can be summarized as follows. We show how Problem (1.1) can be reformulated as the unconstrained minimization of a real-valued, continuously differentiable function, the FBE, providing a framework that allows to extend classical algorithms for smooth unconstrained optimization to nonsmooth or constrained problems in composite form (1.1). Moreover, based on this framework, we devise efficient proximal Newton algorithms with -superlinear or -quadratic asymptotic convergence rate to solve (1.1), with global complexity bounds. The conjugate gradient (CG) method is employed to compute efficiently an approximate Newton direction at every iteration. Therefore our algorithms are able to handle large-scale problems since they require only the calculation of matrix-vector products and there is no need to form explicitly the generalized Hessian matrix.
The outline of the paper is as follows. In Section 2 we introduce the FBE, a continuously differentiable penalty function for (1.1), and discuss some of its properties. In Section 3 we discuss the generalized differentiability properties of the gradient of the FBE and introduce a linear Newton approximation (LNA) for it, which plays a role similar to that of the Hessian in the classical Newton method. Section 4 is the core of the paper, presenting two algorithms for solving Problem (1.1) and discussing their local and global convergence properties. In Section 5 we consider some examples of and discuss the generalized Jacobian of their proximal operator, on which the LNA is based. Finally, in Section 6, we consider some practical problems and show how the proposed methods perform in solving them.
2. Forward-backward envelope
In the following we indicate by and , respectively, the set of solutions of problem (1.1) and its optimal objective value. Forward-backward splitting for solving (1.1) relies on computing, at every iteration, the following update
where the proximal mapping  of is defined by
The value function of the optimization problem (2.2) defining the proximal mapping is called the Moreau envelope and is denoted by , i.e.,
Properties of the Moreau envelope and the proximal mapping are well documented in the literature [16, 17, 18, 5]. For example, the proximal mapping is single-valued, continuous and nonexpansive (Lipschitz continuous with Lipschitz ) and the envelope function is convex, continuously differentiable, with -Lipschitz continuous gradient
Let , where , . The forward-backward envelope of is given by
Alternatively, one can express as the value function of the minimization problem that yields forward-backward splitting. In fact
One distinctive feature of is the fact that it is real-valued despite the fact that can be extended-real-valued. In addition, enjoys favorable properties, summarized in the next theorem.
The following properties of hold:
is continuously differentiable with
If then the set of stationary points of equals .
For any ,
For any ,
In particular, if then
If then .
Part (i) has already been proven. Regarding (ii), from the optimality condition for the problem defining the proximal mapping we have
i.e., is a subgradient of at . From the subgradient inequality
Adding to both sides proves the claim. For part (iii), we have
where the inequality follows by Lipschitz continuity of and the descent lemma, see e.g. [20, Prop. A.24]. For part (iv), putting in (2.10) and (2.11) and using we obtain . Now, for any we have , where the first inequality follows by optimality of for , while the second inequality follows by (2.11). This shows that every is also a (global) minimizer of . The proof finishes by recalling that the set of minimizers of are a subset of the set of its stationary points, which by (i) is equal to . ∎
Parts (i) and (iv) of Theorem (2.2) show that if , the nonsmooth problem (1.1) is completely equivalent to the unconstrained minimization of the continuously differentiable function , in the sense that the sets of minimizers and optimal values are equal. In other words we have
Part (ii) shows that an -optimal solution of is automatically -optimal for , while part (iii) implies that from an -optimal for we can directly obtain an -optimal solution for if is chosen sufficiently small, i.e.,
Notice that part (iv) of Theorem 2.2 states that if , then not only do the stationary points of agree with (cf. Theorem 2.2(i)), but also that its set of minimizers agrees with , i.e., although may not be convex, the set of stationary points turns out to be equal to the set of its minimizers. However, in the particular but important case where is convex quadratic, the FBE is convex with Lipschitz continuous gradient, as the following theorem shows.
If and , then , where
and , .
Due to Lemma A.1 (in the Appendix), is strongly convex with modulus . Function is convex, as the composition of the convex function with the linear mapping . Therefore, is strongly convex with convexity parameter . On the other hand, for every
where the second inequality is due to Lemma A.2 in the Appendix. ∎
Notice that if and we choose , then and , so . In other words the condition number of is roughly double compared to that of .
It is apparent from (2.1) and (2.5) that FBS is a Picard iteration for computing a fixed point of the nonexpansive mapping . It is well known that fixed-point iterations may exhibit slow asymptotic convergence. On the other hand, Newton methods achieve much faster asymptotic convergence rates. However, in order to devise globally convergent Newton-like methods one needs a merit function on which to perform a line search, in order to determine a step size that guarantees sufficient decrease and damps the Newton steps when far from the solution. This is exactly the role that FBE plays in this paper.
Another interesting observation is that the FBE provides a link between gradient methods and FBS, just like the Moreau envelope (2.3) does for the proximal point algorithm . To see this, consider the problem
where . The proximal point algorithm for solving (2.14) is
It is well known that the proximal point algorithm can be interpreted as a gradient method for minimizing the Moreau envelope of , cf. (2.3). Indeed, due to (2.4), iteration (2.15) can be expressed as
This simple idea provides a link between nonsmooth and smooth optimization and has led to the discovery of a variety of algorithms for problem (2.14), such as semismooth Newton methods , variable-metric  and quasi-Newton methods , and trust-region methods , to name a few. However, when dealing with composite problems, even if and are cheaply computable, computing proximal mapping and Moreau envelope of is usually as hard as solving (1.1) itself. On the other hand, forward-backward splitting takes advantage of the structure of the problem by operating separately on the two summands, cf. (2.1). The question that naturally arises is the following:
Is there a continuously differentiable function that provides an interpretation of FBS as a gradient method, just like the Moreau envelope does for the proximal point algorithm and problem (2.14)?
The forward-backward envelope provides an affirmative answer. Specifically, FBS can be interpreted as the following (variable metric) gradient method on the FBE:
Furthermore, the following properties holding for
correspond to Theorem 2.2(iii) and Theorem 2.2(iv) for the FBE. The relationship between Moreau envelope and forward-backward envelope is then apparent. This opens the possibility of extending FBS and devising new algorithms for problem (1.1) by simply reconsidering and appropriately adjusting methods for unconstrained minimization of continuously differentiable functions, the most well studied problem in optimization. In this work we exploit one of the numerous alternatives, by devising Newton-like algorithms that are able to achieve fast asymptotic convergence rates. The next section deals with the other obstacle that needs to be overcome, i.e., constructing a second-order expansion for the (but not ) function around any optimal solution, that behaves similarly to the Hessian for functions and allows us to devise algorithms with fast local convergence.
3. Second-order Analysis of
As it was shown in Section 2, is continuously differentiable over . However fails to be in most cases: since is nonsmooth, its Moreau envelope is hardly ever . For example, if is real-valued then is and is if and only if is . Therefore, we hardly ever have the luxury of assuming continuous differentiability of and we must resort into generalized notions of differentiability stemming from nonsmooth analysis. Specifically, our analysis is largely based upon generalized differentiability properties of which we study next.
3.1. Generalized Jacobians of proximal mappings
Since is globally Lipschitz continuous, by Rademacher’s theorem [17, Th. 9.60] it is almost everywhere differentiable. Recall that Rademacher’s theorem asserts that if a mapping is locally Lipschitz continuous on , then it is almost everywhere differentiable, i.e., the set has measure zero, where is the subset of points in for which is differentiable. Hence, although the Jacobian of in the classical sense might not exist everywhere, generalized differentiability notions, such as the -subdifferential and the generalized Jacobian of Clarke, can be employed to provide a local first-order approximation of .
Let be locally Lipschitz continuous at . The B-subdifferential (or limiting Jacobian) of at is
whereas the (Clarke) generalized Jacobian of at is
If is locally Lipschitz on then is a nonempty, convex and compact subset of by matrices, and as a set-valued mapping it is outer-semicontinuous at every . The next theorem shows that the elements of the generalized Jacobian of the proximal mapping are symmetric and positive semidefinite. Furthermore, it provides a bound on the magnitude of their eigenvalues.
Suppose that and . Every is a symmetric positive semidefinite matrix that satisfies .
Since is convex, its Moreau envelope is a convex function as well, therefore every element of is a symmetric positive semidefinite matrix (see e.g. [13, Sec. 8.3.3]). Due to (2.4), we have that , therefore
The last relation holds with equality (as opposed to inclusion in the general case) due to the fact that one of the summands is continuously differentiable. Now from (3.1) we easily infer that every element of is a symmetric matrix. Since is Lipschitz continuous with Lipschitz constant , using [27, Prop. 2.6.2(d)], we infer that every satisfies . Now, according to (3.1), it holds
On the other hand, since is Lipschitz continuous with Lipschitz constant 1, using [27, Prop. 2.6.2(d)] we obtain that , for all . ∎
An interesting property of , documented in the following proposition, is useful whenever is (block) separable, i.e., , , . In such cases every is a (block) diagonal matrix. This has favorable computational implications especially for large-scale problems. For example, if is the norm or the indicator function of a box, then the elements of (or ) are diagonal matrices with diagonal elements in (or in ).
Proposition 3.3 (separability).
If is (block) separable then every element of and is (block) diagonal.
Since is block separable, its proximal mapping has the form
The result follows directly by Definition 3.1. ∎
The following proposition provides a connection between the generalized Jacobian of the proximal mapping for a convex function and that of its conjugate, stemming from the celebrated Moreau’s decomposition [16, Th. 14.3].
Proposition 3.4 (Moreau’s decomposition).
Suppose that . Then
Using Moreau’s decomposition we have
The first result follows directly by Definition 3.1, since is expressed as the difference of two functions, one of which is continuously differentiable. The second result follows from the fact that, with a little abuse of notation,
Semismooth mappings  are precisely Lipschitz continuous mappings for which the generalized Jacobian (and consequenlty the -subdifferential) furnishes a first-order approximation.
Let be locally Lipschitz continuous at . We say that is semismooth at if
whereas is said to be strongly semismooth if can be replaced with .
We remark that the original definition of semismoothness given by  requires to be directionally differentiable at . The definition given here is the one employed by . Another worth spent remark is that can be replaced with the smaller set in Definition 3.5.
Fortunately, the class of semismooth mappings is rich enough to include proximal mappings of most of the functions arising in interesting applications. For example piecewise smooth () mappings are semismooth everywhere. Recall that a continuous mapping is if there exists a finite collection of smooth mappings , such that
The definition of mappings given here is less general than the one of, e.g., [31, Ch. 4] but it suffices for our purposes. For every we introduce the set of essentially active indices
In other words, contains only indices of the pieces for which there exists a full-dimensional set on which agrees with . In accordance to Definition 3.1, the generalized Jacobian of at is the convex hull of the Jacobians of the essentially active pieces, i.e., [31, Prop. 4.3.1]
A special but important class of convex functions whose proximal mapping is are piecewise quadratic (PWQ) functions. A convex function is called PWQ if can be represented as the union of finitely many polyhedral sets, relative to each of which is given by an expression of the form ( must necessarily be symmetric positive semidefinite) [17, Def. 10.20]. The class of PWQ functions is quite general since it includes e.g. polyhedral norms, indicators and support functions of polyhedral sets, and it is closed under addition, composition with affine mappings, conjugation, inf-convolution and inf-projection [17, Prop. 10.22, Proposition 11.32]. It turns out that the proximal mapping of a PWQ function is piecewise affine (PWA) [17, 12.30] ( is partitioned in polyhedral sets relative to each of which is an affine mapping), hence strongly semismooth [13, Prop. 7.4.7]. Another example of a proximal mapping that it is strongly semismooth is the projection operator over symmetric cones . We refer the reader to [33, 34, 35, 36] for conditions that guarantee semismoothness of the proximal mapping for more general convex functions.
3.2. Approximate generalized Hessian for
Having established properties of generalized Jacobians for proximal mappings, we are now in position to construct a generalized Hessian for that will allow the development of Newton-like methods with fast asymptotic convergence rates. The obvious route to follow is to assume that is semismooth and employ as a generalized Hessian for . However, semismoothness would require extra assumptions on . Furthermore, the form of is quite complicated involving third-order partial derivatives of . On the other hand, what is really needed to devise Newton-like algorithms with fast local convergence rates is a linear Newton approximation (LNA), cf. Definition 3.6, at some stationary point of , which by Theorem 2.2(iv) is also a minimizer of , provided that . The approach we follow is largely based on , [13, Prop. 10.4.4]. The following definition is taken from [13, Def. 7.5.13].
Let be continuous on . We say that admits a linear Newton approximation at a vector if there exists a set-valued mapping that has nonempty compact images, is upper semicontinuous at and for any
then we say that admits a strong linear Newton approximation at .
Arguably the most notable example of a LNA for semismooth mappings is the generalized Jacobian, cf. Definition 3.1. However, semismooth mappings can admit LNAs different from the generalized Jacobian. More importantly, mappings that are not semismooth may also admit a LNA. It turns out that we can define a LNA for at any stationary point, whose elements have a simpler form than those of , without assuming semismoothness of . We call it approximate generalized Hessian and it is given by
The key idea in the definition of , reminiscent to the Gauss-Newton method for nonlinear least-squares problems, is to omit terms vanishing at that contain third-order derivatives of . The following proposition shows that is indeed a LNA of at any .
Let , and . Then
if is semismooth at , then is a LNA for at ,
if is strongly semismooth at , and is locally Lipschitz around , then is a strong LNA for at .
See Appendix. ∎
The next proposition shows that every element of is a symmetric positive semidefinite matrix, whose eigenvalues are lower and upper bounded uniformly over all .
Any is symmetric positive semidefinite and satisfies
where , .
See Appendix. ∎
The next lemma shows uniqueness of the solution of (1.1) under a nonsingularity assumption on the elements of . Its proof is similar to [13, Lem. 7.2.10], however is not required to be locally Lipschitz around .
Let . Suppose that , is semismooth at and every element of is nonsingular. Then is the unique solution of (1.1). In fact, there exist positive constants and such that
See Appendix. ∎
4. Forward-Backward Newton-CG Methods
Having established the equivalence between minimizing and , as well as a LNA for , it is now very easy to design globally convergent Newton-like algorithms with fast asymptotic convergence rates, for computing a . Algorithm 1 is a standard line-search method for minimizing , where a conjugate gradient method is employed to solve (approximately) the corresponding regularized Newton system. Therefore our algorithm does not require to form an element of the generalized Hessian of explicitly. It only requires the computation of the corresponding matrix-vector product and is thus suitable for large-scale problems. Similarly, there is no need to form explicitly the Hessian of , in order to compute the directional derivative needed in the backtracking procedure for computing the stepsize (4.4); only matrix-vector products with are required. Under nonsingularity of the elements of , eventually the stepsize becomes equal to 1 and Algorithm 1 reduces to a regularized version of the (undamped) linear Newton method [13, Alg. 7.5.14] for solving .
The next theorem delineates the basic convergence properties of Algorithm 1.
Every accumulation point of the sequence generated by Algorithm 1 belongs to .
We will first show that the sequence is gradient related to [20, Sec. 1.2]. That is, for any subsequence that converges to a nonstationary point of , i.e.,
the corresponding subsequence is bounded and satisfies
Without loss of generality we can restrict to subsequences for which , for all . Suppose that is one such subsequence. Due to (4.3a), we have for all . Matrix is positive semidefinite due to Proposition 3.8, therefore is nonsingular for all and
Now, direction satisfies
where . Therefore
As , the right hand side of (4.9) converges to , which is either a finite negative number (if is finite) or . In any case, this together with (4.9) confirm that (4.6) is valid as well, proving that is gradient related to . All the assumptions of [20, Prop. 1.2.1] hold, therefore every accumulation point of converges to a stationary point of , which by Theorem 2.2(iv) is also a minimizer of . ∎
The next theorem shows that under a nonsingularity assumption on , the asymptotic rate of convergence of the sequence generated by Algorithm 1 is at least superlinear.
Suppose that is an accumulation point of the sequence generated by Algorithm 1. If is semismooth at and every element of is nonsingular, then the entire sequence converges to and the convergence rate is Q-superlinear. Furthermore, if is strongly semismooth at and is locally Lipschitz continuous around then converges to with Q-order at least .
Theorem 4.1 asserts that must be a stationary point for . Due to Proposition 3.7, is a LNA of at . Due to Lemma 3.9, is the globally unique minimizer of . Therefore, by Theorem 4.1 every subsequence must converge to this unique accumulation point, implying that the entire sequence converges to . Furthermore, for any
where the second inequality follows from and Lemma A.2 (in the Appendix).
We know that satisfies