Local and Global Convergence of an Inertial Version of Forward-Backward Splitting The proofs of Thms. 4.1, 5.1, 5.2, and 5.6 of this manuscript contain several errors. These errors have been fixed in a revised and rewritten manuscript entitled “Local and Global Convergence of a General Inertial Proximal Splitting Scheme” arxiv id. 1602.02726. We recommend reading this updated manuscript.

Local and Global Convergence of an Inertial Version of Forward-Backward Splitting thanks: The proofs of Thms. 4.1, 5.1, 5.2, and 5.6 of this manuscript contain several errors. These errors have been fixed in a revised and rewritten manuscript entitled “Local and Global Convergence of a General Inertial Proximal Splitting Scheme” arxiv id. 1602.02726. We recommend reading this updated manuscript.

Patrick R. Johnstone Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL, 61801, USA (contact: prjohns2@illinois.edu)    Pierre Moulinfootnotemark:
Abstract

A problem of great interest in optimization is to minimize a sum of two closed, proper, and convex functions where one is smooth and the other has a computationally inexpensive proximal operator. In this paper we analyze a family of Inertial Forward-Backward Splitting (I-FBS) algorithms for solving this problem. We first apply a global Lyapunov analysis to I-FBS and prove weak convergence of the iterates to a minimizer in a real Hilbert space. We then show that the algorithms achieve local linear convergence for “sparse optimization”, which is the important special case where the nonsmooth term is the -norm. This result holds under either a restricted strong convexity or a strict complimentary condition and we do not require the objective to be strictly convex. For certain parameter choices we determine an upper bound on the number of iterations until the iterates are confined on a manifold containing the solution set and linear convergence holds.

The local linear convergence result for sparse optimization holds for the Fast Iterative Shrinkage and Soft Thresholding Algorithm (FISTA) due to Beck and Teboulle which is a particular parameter choice for I-FBS. In spite of its optimal global objective function convergence rate, we show that FISTA is not optimal for sparse optimization with respect to the local convergence rate. We determine the locally optimal parameter choice for the I-FBS family. Finally we propose a method which inherits the excellent global rate of FISTA but also has excellent local rate.

Key words. proximal gradient methods, forward-backward splitting, inertial methods, -regularization, local linear convergence

AMS subject classifications. 65K05, 65K15, 90C06, 90C25

1 Introduction

We are concerned with the following important problem:

\hb@xt@.01(1.1)

where is a Hilbert space over the real numbers, the functions and are proper, convex and closed, and in addition is Gâteaux differentiable, and has a Lipschitz continuous gradient. Problems of this form have come under considerable attention in recent years in applications such as machine learning [1, 2], compressed sensing [3, 4] and image processing [5, 6] among many other examples. Of particular interest in this paper will be the special case which we will call sparse optimization (SO).

where , and . We refer to this problem as “sparse optimization” because the -norm encourages sparse solutions. When with and , Problem SO is often referred to as sparse least squares (Problem -LS), basis pursuit denoising or LASSO. This problem is of central importance in compressed sensing and also has applications in machine learning [7] and image processing [8]. Other important instances of Problem (LABEL:prob:1) include least squares with a total-variation [9] or nuclear-norm [10] regularizer, and minimization constrained to a closed and convex set.

1.1 Background

In this paper we focus on first-order splitting methods for solving Problem (LABEL:prob:1). These methods use evaluations of , gradients of the smooth part , and evaluations of the proximal operator of the nonsmooth part . In particular we focus on the forward-backward splitting algorithm (FBS), which is a classical first-order splitting approach to solving Problem (LABEL:prob:1) [11, 12]. In fact FBS was developed for the more general monotone inclusion problem which includes Problem (LABEL:prob:1) as a special case. FBS involves a “forward” step, which is an explicit gradient step with respect to the differentiable component and a “backward” step, which is an implicit, proximal step with respect to . For many popular instances of this proximal step is computationally inexpensive [13]. The convergence rate of the objective function to the infimum is , which is better than the rate achieved by the “black-box” subgradient method, and is the same as if the possibly nonsmooth component were not present. Weak convergence of the iterates is also guaranteed and linear convergence occurs on strongly convex problems [14]. FBS is also commonly referred to as the proximal gradient method [15] and for the special case of Problem SO, it is known as the iterative shrinkage and soft thresholding algorithm (ISTA) owing to the form of the proximal step w.r.t. the -norm [16, 17, 18]. Other first-order splitting methods include ADMM [19], linearized and preconditioned ADMM [20], primal-dual methods [21], Bregman iterations [22] and generalized FBS [23]. These methods can deal with more complicated situations such as when is composed with a bounded linear operator or when the sum of proximable111Possessing a simple proximal operator. functions is present.

Nesterov developed several methods for minimizing a convex function with Lipschitz gradient ([24], [25] chapter 2). These methods obtain the best objective function convergence rate possible by any first order method. Specifically, they guarantee a convergence rate of for the objective function, which is optimal in the worst case sense for convex functions with Lipschitz gradient. Note that this improves the rate achieved by classical gradient descent.

In [16], Beck and Teboulle extended Nesterov’s method of [25] to Problem (LABEL:prob:1), allowing for the presence of the possibly nonsmooth function . Their method, FISTA, combines Nesterov’s inertial update into an FBS framework using the same sequence of “momentum” parameters. FISTA corresponds to a particular parameter choice for the following suite of algorithms, which we will call Inertial Forward-Backward Splitting (I-FBS).

with chosen arbitrarily (typically ). The sequences and are a subset of . The proximal operator will be properly defined in Section LABEL:sec:prox. Beck and Teboulle showed that for a specific choice of and , I-FBS obtains the optimal rate in terms of the objective function, however they did not prove convergence of the iterates to a minimizer which is also unknown for Nesterov’s method. Tseng [26] showed that other choices also achieve rate. Recently in [27], Chambolle and Dossal considered a very similar choice of the parameters to Beck and Teboulle which obtains rate in the objective function and also weak convergence of the iterates to a minimizer. Throughout the rest of the paper we will refer to all these parameter choices for I-FBS that obtain the objective function rate as FISTA-like choices. Note that FBS corresponds to I-FBS with set to for all and in the range where is the Lipschitz constant of . Nesterov’s method of [25] corresponds to I-FBS with the same parameter choice as FISTA and with .

One of the aims of this paper is to establish broad conditions for the convergence of the iterates of I-FBS to a minimizer of Problem (LABEL:prob:1). A generalization of the I-FBS family has been studied previously in [28] in the setting of monotone operator inclusion problems. However our global analysis proves convergence for a wider range of parameter choices than was proved there. An algorithm similar to I-FBS was developed in [29] for the more general problem of finding a fixed-point of a nonexpansive operator. However the conditions for convergence are far more strict than those developed in this paper. To the best of our knowledge the conditions for weak convergence of the iterates of I-FBS developed in this paper are novel in the literature. A more detailed comparison with existing literature is given in Section LABEL:sec:fistalyap.

It has been observed that for the special case of Problem SO FBS exhibits local linear convergence (see e.g. [17, 18, 30, 31]), elsewhere called eventual linear convergence [32]. By this it is meant that there exists some such that for all the iterates are confined to a manifold containing the solution set and convergence to a solution is linear. It is not known whether I-FBS (including the FISTA-like choices) obtains local linear convergence for Problem SO, however recently [33] has made progress for the special case of Problem -LS. In this paper, we address this by establishing local linear convergence of I-FBS for Problem SO for a broad range of parameter choices including the FISTA-like choices. Of course local linear convergence of the sequence implies convergence of the entire sequence.

1.2 Contributions of this Paper

In the first part of the paper, we analyze I-FBS with an appropriate multi-step Lyapunov function. This approach allows us to develop novel conditions on the algorithmic parameters that imply convergence of the iterates to a minimizer (weak convergence in a real Hilbert space, ordinary convergence in ). This widens the range of possible parameter choices beyond those proposed in prior art such as [28].

In the second part of the paper, we consider in detail the behavior of I-FBS applied to Problem SO. We show that after a finite number of iterations I-FBS reduces to minimizing a local function on a reduced support subject to an orthant constraint. This result holds for the FISTA-like choices along with a wide range of other parameter choices. Next we show that a simple “locally optimal” parameter choice for I-FBS obtains a local linear convergence rate with the best asymptotic iteration complexity. The asymptotically optimal iteration complexity is better than that obtained by the FISTA-like choices and by ISTA. The improvement gained by I-FBS over ISTA when the correct amount of momentum is added is equivalent to the improvement that Nesterov’s accelerated method [25] achieves over gradient descent for strongly convex functions with Lipschitz gradients. As a corollary of our analysis, we show that the adaptive momentum restart scheme proposed in [34] achieves the optimal iteration complexity. In conrast the analysis in [34] is only valid for strongly convex quadratic functions. Finally for parameter choices for which the “momentum parameter” is bounded away from , we determine an explicit upper bound on the number of iterations until convergence to the optimal manifold.

With little effort our analysis of I-FBS for Problem SO can be adapted to apply to the splitting inertial proximal method (SIPM) proposed by Moudafi and Oliny [35]. This method is a direct generalization of the heavy ball with friction method (HBF) [36] to proximal splitting problems and differs from I-FBS in that the gradient of is computed at rather than . We show that SIPM also achieves local linear convergence for this problem under appropriate parameter constraints.

The paper is organized as follows. In Section LABEL:sec:prelim, notation and assumptions are discussed. In Section LABEL:sec:fistalyap, we precisely define the I-FBS family and discuss known convergence results in more detail. In Section LABEL:sec:fistalyapanal we apply our Lyapunov analysis to I-FBS. In Section LABEL:sec:l1ls we derive convergence results for Problem SO. Finally, numerical experiments are presented in Section LABEL:sec:sims.

2 Preliminaries

2.1 Notation and Definitions

Throughout the paper, is a Hilbert space over the field of real numbers, is the inner product and is the associated norm. Let be the set of all closed, convex and proper functions whose domain is a subset of and range is a subset of . For any and point , we denote by for the -enlargement of the subdifferential, defined as the set

\hb@xt@.01(2.1)

which is always convex and closed and may be empty. We will use to denote . When is a singleton we will call it the gradient at , denoted by .

For and , the notation (resp. ) means there exists a constant such that (resp. ). The notation means . We will say a sequence converges linearly to with rate of convergence , if . To be precise we will occasionally refer to this as asymptotic or local linear convergence. Note that this is different from nonasymptotic, or global linear covergence with rate , in which case there exists a such that for all . In contrast local linear convergence allows for a finite number of iterations where such a relationship does not hold.

Define the optimal value of Problem (LABEL:prob:1) as

and the solution set as

Given a function , we say that the iteration complexity of a method for minimizing is if implies . To be precise we will occasionally refer to this as the asymptotic iteration complexity.

For a matrix and a set , will denote the matrix in formed by taking the columns corresponding to the elements of . For a vector , will denote the vector with entries given by the entries of on the indices corresponding to the elements of , and will denote the vector in equal to on the indices corresponding to and equal to zero everywhere else. Given and , is defined as if and if , is simply applying element-wise. We will use the notation .

2.2 Proximal Operators

The proximal operator w.r.t. a function is defined implicitly by

and explicitly by

\hb@xt@.01(2.2)

Since the function being minimized in (LABEL:eq:prox_def) is strongly convex and in , exists and is unique for every thus it is a well defined mapping with domain equal to . To be more general we will actually use the -enlarged proximal operator, which is the set

which is not necessarily uniquely defined (except when ). Note that for all . The use of allows for some approximation error in the computation of the proximal operator.

2.3 Cocoercivity and Convexity

We say that a Gâteaux differentiable and convex function has a -cocoercive gradient with , if

\hb@xt@.01(2.3)

Note this is equivalent to the gradient being -Lipschitz continuous, i.e.

\hb@xt@.01(2.4)

For a proof see [37] Lemma 1.4 and the Baillon-Haddad Theorem [38]. We will need the following two standard properties of such a function. For all :

\hb@xt@.01(2.5)

and (by convexity)

\hb@xt@.01(2.6)

We are now ready to formerly state our Assumptions for Problem (LABEL:prob:1).

Assumption 1. and are in , is Gâteaux differentiable everywhere and has a -cocoercive gradient with , and .

2.4 Properties of Sparse Optimization

We now outline our assumptions for Problem SO and discuss some of its properties.

Assumption SO. , is twice differentiable everywhere, and has a -cocoercive gradient with . and is non-empty.

The main difference between Assumption SO and Assumption 1 is that we additionally assume that is twice differentiable. Let denote the Hessian of at . Then the Lipschitz constant of the gradient is equal to the supremum of the largest eigenvalue of over all . Furthermore note that is in . Finally note that for the function is coercive thus is non-empty.

Problem SO includes Problem -LS, defined as

where and . The solution set of Problem -LS is always non-empty. The function has gradient equal to which is Lipschitz-continuous with Lipschitz constant equal to the largest eigenvalue of .

The proximal operator associated with is the shrinkage and soft-thresholding operator , applied element-wise. It is defined as , and thus

\hb@xt@.01(2.7)

In the analysis of I-FBS applied to Problem SO we will need the following result proved in [17].

Theorem 2.1 (Theorem 2.1 [17])

For problem SO suppose Assumption SO holds, then there exists a vector such that for all , . Furthermore, for all ,

The following two sets also used in [17] will also be crucial to our analysis. Let and Note that and . By Theorem LABEL:thm:constGrad, we can infer that for all . Finally, define

We will need the following Lemma proved in [17].

Lemma 2.2 (Lemma 4.1 [17])

Under Assumption SO, if ,

An alternative definition of cocoercivity is to say that if an operator is -cocoercive than is firmly nonexpansive. Thus Lemma LABEL:lemma:nonexpansive is just an elementary property of firmly nonexpansive operators (see Proposition 4.2 (iii), and Proposition 4.33 [39]).

Finally, the following properties of will be useful.

Lemma 2.3 (Lemma 3.2 [17])

Fix any and in , and :

  • The function is nonexpansive. That is,

  • If and then

  • If then and

    \hb@xt@.01(2.8)

3 I-Fbs

To be more general, our global analysis will apply to the following I-FBS family.

with chosen arbitrarily. Note that for any ,

We will refer to as the “momentum” parameters and as the “step-size” parameters. The algorithm differs from I-FBS in that it uses the -enlarged sub-differential, allowing for some error in the computation of the proximal operator.

3.1 Known Convergence Results

Beck and Teboulle [16] proposed the following choice of parameters for I-FBS (I-FBS- with the set to for all ),

\hb@xt@.01(3.1)

The method is known as FISTA. With this choice of parameters, Beck and Teboulle showed that the objective function converges to the minimum at the worst-case optimal rate of . In fact the rate holds for a variety of choices of which all have the form: [26]. However the choice in (LABEL:eq:fista_mo2) guarantees the largest possible decrease in a given upper bound of at each iteration. Chambolle and Dossal [27] considered I-FBS with a similar choice of to what was proposed by Beck and Teboulle. They investigated, for some ,

\hb@xt@.01(3.2)

With this choice of parameters, the authors showed that the objective function achieves the optimal convergence rate and in addition weakly converges to a minimizer.

In contrast to [16] and [27], our analysis establishes weak convergence of the iterates for a wide range of parameter choices. Indeed, the momentum sequence is not constrained to follow a particular sequence relationship, but instead must be constrained to and . However we do not guarantee the objective function rate.

Lorenz and Pock [28] generalized I-FBS to the problem of finding a zero of the sum of two maximal monotone operators and , one of which is cocoercive. Setting and recovers Problem (LABEL:prob:1). They also replaced the scalar step-size with a general positive definite operator . Lorenz and Pock proved weak convergence of the iterates to a solution provided certain restrictions on and . The restrictions on are stronger than those derived in our global analysis. In their analysis, if the step-size is fixed to , is restricted to be less than , whereas, as we shall see in Section LABEL:sec:lyap, our Lyapunov analysis allows , so long as . For the step-size, their conditions are less restrictive than ours, allowing for values of up to , whereas our analysis only allows up to . However in their analysis larger values of lead to a smaller range of feasible values for reducing to as approaches .

In [29], an inertial version of the classical Krasnosel’skiĭ-Mann (KM) algorithm was analyzed. The KM algorithm finds the fixed points of a nonexpansive operator . Setting the operator in the inertial KM method of [29] recovers I-FBS, since a point is a fixed point of if and only if it is a solution of Problem (LABEL:prob:1). The analysis of [29] proves weak convergence of the iterates to a fixed point but relies on verifying conditions of the form: with equal to and . In general this condition must be enforced online, restricting the range of possible choices for the sequence of momentum parameters. However, it was shown in [40] that choosing to be nondecreasing and satisfying with suffices to ensure the condition is satisfied and thus prove weak convergence. This condition is more restrictive than the ones derived in this paper for the special case of Problem (LABEL:prob:1).

3.2 Known Convergence Results for Sparse Optimization

The FISTA-like sequences for defined in (LABEL:eq:fista_mo2) and (LABEL:eq:fista_mo3) both converge to . As we will see in Section LABEL:sec:underD this is not desirable for Problem SO. In the language of dynamical systems, when the momentum is too high the iterates move into an “underdamped regime” leading to oscillations in the objective function and slow convergence (see [34] for an analysis in the strongly-convex quadratic case). We will show that for Problem SO the FISTA-like choices are not optimal from the viewpoint of asymptotic rate of convergence under a local strong-convexity assumption (Corollary LABEL:cor:FISTAforL1LS) or a strict complimentarity condition (Corollary LABEL:cor:strict_comp).

In [33], the behavior of ISTA and FISTA (i.e. I-FBS with parameter choice (LABEL:eq:fista_mo2)) applied to Problem -LS was investigated through a spectral analysis. The authors show that both algorithms obtain local linear convergence for this problem, under the condition that the minimizer is unique, but without an estimate for the number of iterations until convergence to the optimal manifold. Furthermore they determine that the local rate of convergence of FISTA is worse than ISTA, while the transient behavior of FISTA is better than ISTA. Therefore they suggest switching from FISTA to ISTA once the optimal manifold has been identified. Our contribution differs in several ways. We note that the poor local performance of the FISTA-like choices is due to having the momentum parameter converge to . Therefore we determine the optimal value for the momentum parameter that should be used in the asymptotic regime which allows for a better asymptotic rate than both ISTA and FISTA and suggest a heuristic method for estimating the optimal momentum. We also show that the adaptive restart method of [34] will achieve the rate in the transient regime and the optimal asymptotic rate. Furthermore our analysis holds for Problem SO with Problem -LS as a special case and we do not require the minimizer to be unique. Finally, in the case where , we provide explicit upper bounds on the number of iterations until I-FBS has converged to the optimal manifold.

In [41] a method was developed for solving Problem (LABEL:prob:1) when is strongly convex. The method is equivalent to I-FBS with the same prescription for as determined by Nesterov for his method for minimizing strongly convex functions (constant scheme 2.2.8. of [25]). However it also includes a backtracking procedure for adjusting and when the strong convexity and Lipschitz gradient parameters are not known. The authors of [41] also extended their method to Problem -LS including the case where is not strongly convex. The authors showed that under conditions on the matrix related to the Restricted Isometry Property (RIP) used in compressed sensing, their algorithm obtains nonasymptotic (global) linear convergence, so long as the initial vector is sufficiently sparse. However, as the authors note the RIP-like conditions are much stronger than those typically found in the literature. Indeed the conditions are much stronger than those required in our proof of local linear convergence. We establish that I-FBS obtains local linear convergence regardless of the initialization point. Furthermore no RIP-like assumptions are necessary. Local linear convergence can be proved under the mild condition that the smallest eigenvalue of the Hessian restricted to the support of a minimum is non-zero at the minimum point. Or if this does not hold, under a common strict-complementarity condition (see Section LABEL:sec:paramsfist). That being said, it should be noted that local linear convergence is not as strong a statement as global linear convergence

4 A Global Analysis of I-FBS

This section derives conditions on , and which imply weak convergence of the iterates of I-FBS to a minimizer of Problem (LABEL:prob:1). Throughout the rest of the paper, let denote . Given , define .

Theorem 4.1

Suppose that Assumption 1 holds. Assume is non-decreasing and satisfies for all , and satisfies for all and . If for all and , then for the iterates of I-FBS-, we have

  1. .

  2. .

  3. If, in addition, is non-empty, then converges weakly to some .

Proof. The proof consists of two parts. In the first, we prove statements (LABEL:state:finitesum) and (LABEL:state:convgrad) using arguments inspired by Alvarez’ analysis of the inertial proximal method in [42]. In the second part, we invoke Opial’s Lemma [43] to prove statement (LABEL:state:weak). The second part is inspired by the analysis of the splitting inertial proximal algorithm by Moudafi and Oliny in [35].

Proof of statements (LABEL:state:finitesum) and (LABEL:state:convgrad)

Define the Lyapunov function, or discrete energy, to be

Note that this is the same energy function used by Alvarez [44]. Inequalities (LABEL:eq:ineq1), (LABEL:eq:ineq2) and (LABEL:eq:ineq3) imply

\hb@xt@.01(4.1)

Note that the existence of a subgradient is guaranteed because the -enlarged proximal operator has domain equal to . Using (LABEL:eq:lyap1), the - update in I-FBS and the fact that , we infer that

Moving terms to the other side and summing implies, for all ,

\hb@xt@.01(4.2)

Inequality (LABEL:eq:sumfinite) along with the assumptions on and imply statement (LABEL:state:finitesum). Statement (LABEL:state:finitesum) implies , therefore via the - update of I-FBS. This implies that , because . Finally, using the - update of I-FBS we infer that

\hb@xt@.01(4.3)

Proof of statement (LABEL:state:weak)

If is a subsequence which weakly converges to , then the -update of I-FBS implies also weakly converges to . This, combined with the -update implies that . Suppose that for any , the sequence has a limit. This implies the sequence is bounded and therefore it has at least one weakly-convergent subsequence, (ordinary convergence in ). By the above reasoning the limit of this subsequence, must be in . Furthermore exists. Consider another subsequence which converges to . By considering the fact that and the corresponding statement for , one can see that . Therefore the set of weakly convergent subsequences is the singleton . Thus weakly converges to (This is Opial’s Lemma [43]).

Assume is non-empty. We now proceed to show that, for any , the sequence has a limit. Our proof closely follows Moudafi and Oliny’s analysis [35], and is similar to the later variants [27, 28]. The main difference is we allow for to be for a finite number of iterations. Fix and define . Now

\hb@xt@.01(4.4)

Since

and , it follows that

\hb@xt@.01(4.5)

Combining (LABEL:eq:mod1) and (LABEL:eq:mod2) we obtain

\hb@xt@.01(4.6)

Now

\hb@xt@.01(4.7)

Combining (LABEL:eq:mod3) and (LABEL:eq:mod4) yields

\hb@xt@.01(4.8)

Now we use the fact that is cocoercive as follows. Inequality (LABEL:eq:coercdef) implies

\hb@xt@.01(4.9)

where (LABEL:eq:coerc) follows by completing the square. Combining (LABEL:eq:mod5) and (LABEL:eq:coerc) we infer

\hb@xt@.01(4.10)

Note that the coefficients of and are non-positive. Set and

\hb@xt@.01(4.11)

and note that .

The argument from now on is basically identical to [35] except we allow for sequences which are equal to for a finite number of . Restate (LABEL:eq:recurse1) as

\hb@xt@.01(4.12)

Since , there exists an integer and such that for all . This and (LABEL:eq:recurs2) imply that, for

Thus for

Careful examination of this expression yields

\hb@xt@.01(4.13)

Set . Since and , is bounded from below. is non-increasing, therefore we have it converges. Therefore converges for every