Local and Global Convergence of an Inertial Version of ForwardBackward Splitting ^{†}^{†}thanks: The proofs of Thms. 4.1, 5.1, 5.2, and 5.6 of this manuscript contain several errors. These errors have been fixed in a revised and rewritten manuscript entitled “Local and Global Convergence of a General Inertial Proximal Splitting Scheme” arxiv id. 1602.02726. We recommend reading this updated manuscript.
Abstract
A problem of great interest in optimization is to minimize a sum of two closed, proper, and convex functions where one is smooth and the other has a computationally inexpensive proximal operator. In this paper we analyze a family of Inertial ForwardBackward Splitting (IFBS) algorithms for solving this problem. We first apply a global Lyapunov analysis to IFBS and prove weak convergence of the iterates to a minimizer in a real Hilbert space. We then show that the algorithms achieve local linear convergence for “sparse optimization”, which is the important special case where the nonsmooth term is the norm. This result holds under either a restricted strong convexity or a strict complimentary condition and we do not require the objective to be strictly convex. For certain parameter choices we determine an upper bound on the number of iterations until the iterates are confined on a manifold containing the solution set and linear convergence holds.
The local linear convergence result for sparse optimization holds for the Fast Iterative Shrinkage and Soft Thresholding Algorithm (FISTA) due to Beck and Teboulle which is a particular parameter choice for IFBS. In spite of its optimal global objective function convergence rate, we show that FISTA is not optimal for sparse optimization with respect to the local convergence rate. We determine the locally optimal parameter choice for the IFBS family. Finally we propose a method which inherits the excellent global rate of FISTA but also has excellent local rate.
Key words. proximal gradient methods, forwardbackward splitting, inertial methods, regularization, local linear convergence
AMS subject classifications. 65K05, 65K15, 90C06, 90C25
1 Introduction
We are concerned with the following important problem:
\hb@xt@.01(1.1) 
where is a Hilbert space over the real numbers, the functions and are proper, convex and closed, and in addition is Gâteaux differentiable, and has a Lipschitz continuous gradient. Problems of this form have come under considerable attention in recent years in applications such as machine learning [1, 2], compressed sensing [3, 4] and image processing [5, 6] among many other examples. Of particular interest in this paper will be the special case which we will call sparse optimization (SO).
where , and . We refer to this problem as “sparse optimization” because the norm encourages sparse solutions. When with and , Problem SO is often referred to as sparse least squares (Problem LS), basis pursuit denoising or LASSO. This problem is of central importance in compressed sensing and also has applications in machine learning [7] and image processing [8]. Other important instances of Problem (LABEL:prob:1) include least squares with a totalvariation [9] or nuclearnorm [10] regularizer, and minimization constrained to a closed and convex set.
1.1 Background
In this paper we focus on firstorder splitting methods for solving Problem (LABEL:prob:1). These methods use evaluations of , gradients of the smooth part , and evaluations of the proximal operator of the nonsmooth part . In particular we focus on the forwardbackward splitting algorithm (FBS), which is a classical firstorder splitting approach to solving Problem (LABEL:prob:1) [11, 12]. In fact FBS was developed for the more general monotone inclusion problem which includes Problem (LABEL:prob:1) as a special case. FBS involves a “forward” step, which is an explicit gradient step with respect to the differentiable component and a “backward” step, which is an implicit, proximal step with respect to . For many popular instances of this proximal step is computationally inexpensive [13]. The convergence rate of the objective function to the infimum is , which is better than the rate achieved by the “blackbox” subgradient method, and is the same as if the possibly nonsmooth component were not present. Weak convergence of the iterates is also guaranteed and linear convergence occurs on strongly convex problems [14]. FBS is also commonly referred to as the proximal gradient method [15] and for the special case of Problem SO, it is known as the iterative shrinkage and soft thresholding algorithm (ISTA) owing to the form of the proximal step w.r.t. the norm [16, 17, 18]. Other firstorder splitting methods include ADMM [19], linearized and preconditioned ADMM [20], primaldual methods [21], Bregman iterations [22] and generalized FBS [23]. These methods can deal with more complicated situations such as when is composed with a bounded linear operator or when the sum of proximable^{1}^{1}1Possessing a simple proximal operator. functions is present.
Nesterov developed several methods for minimizing a convex function with Lipschitz gradient ([24], [25] chapter 2). These methods obtain the best objective function convergence rate possible by any first order method. Specifically, they guarantee a convergence rate of for the objective function, which is optimal in the worst case sense for convex functions with Lipschitz gradient. Note that this improves the rate achieved by classical gradient descent.
In [16], Beck and Teboulle extended Nesterov’s method of [25] to Problem (LABEL:prob:1), allowing for the presence of the possibly nonsmooth function . Their method, FISTA, combines Nesterov’s inertial update into an FBS framework using the same sequence of “momentum” parameters. FISTA corresponds to a particular parameter choice for the following suite of algorithms, which we will call Inertial ForwardBackward Splitting (IFBS).
with chosen arbitrarily (typically ). The sequences and are a subset of . The proximal operator will be properly defined in Section LABEL:sec:prox. Beck and Teboulle showed that for a specific choice of and , IFBS obtains the optimal rate in terms of the objective function, however they did not prove convergence of the iterates to a minimizer which is also unknown for Nesterov’s method. Tseng [26] showed that other choices also achieve rate. Recently in [27], Chambolle and Dossal considered a very similar choice of the parameters to Beck and Teboulle which obtains rate in the objective function and also weak convergence of the iterates to a minimizer. Throughout the rest of the paper we will refer to all these parameter choices for IFBS that obtain the objective function rate as FISTAlike choices. Note that FBS corresponds to IFBS with set to for all and in the range where is the Lipschitz constant of . Nesterov’s method of [25] corresponds to IFBS with the same parameter choice as FISTA and with .
One of the aims of this paper is to establish broad conditions for the convergence of the iterates of IFBS to a minimizer of Problem (LABEL:prob:1). A generalization of the IFBS family has been studied previously in [28] in the setting of monotone operator inclusion problems. However our global analysis proves convergence for a wider range of parameter choices than was proved there. An algorithm similar to IFBS was developed in [29] for the more general problem of finding a fixedpoint of a nonexpansive operator. However the conditions for convergence are far more strict than those developed in this paper. To the best of our knowledge the conditions for weak convergence of the iterates of IFBS developed in this paper are novel in the literature. A more detailed comparison with existing literature is given in Section LABEL:sec:fistalyap.
It has been observed that for the special case of Problem SO FBS exhibits local linear convergence (see e.g. [17, 18, 30, 31]), elsewhere called eventual linear convergence [32]. By this it is meant that there exists some such that for all the iterates are confined to a manifold containing the solution set and convergence to a solution is linear. It is not known whether IFBS (including the FISTAlike choices) obtains local linear convergence for Problem SO, however recently [33] has made progress for the special case of Problem LS. In this paper, we address this by establishing local linear convergence of IFBS for Problem SO for a broad range of parameter choices including the FISTAlike choices. Of course local linear convergence of the sequence implies convergence of the entire sequence.
1.2 Contributions of this Paper
In the first part of the paper, we analyze IFBS with an appropriate multistep Lyapunov function. This approach allows us to develop novel conditions on the algorithmic parameters that imply convergence of the iterates to a minimizer (weak convergence in a real Hilbert space, ordinary convergence in ). This widens the range of possible parameter choices beyond those proposed in prior art such as [28].
In the second part of the paper, we consider in detail the behavior of IFBS applied to Problem SO. We show that after a finite number of iterations IFBS reduces to minimizing a local function on a reduced support subject to an orthant constraint. This result holds for the FISTAlike choices along with a wide range of other parameter choices. Next we show that a simple “locally optimal” parameter choice for IFBS obtains a local linear convergence rate with the best asymptotic iteration complexity. The asymptotically optimal iteration complexity is better than that obtained by the FISTAlike choices and by ISTA. The improvement gained by IFBS over ISTA when the correct amount of momentum is added is equivalent to the improvement that Nesterov’s accelerated method [25] achieves over gradient descent for strongly convex functions with Lipschitz gradients. As a corollary of our analysis, we show that the adaptive momentum restart scheme proposed in [34] achieves the optimal iteration complexity. In conrast the analysis in [34] is only valid for strongly convex quadratic functions. Finally for parameter choices for which the “momentum parameter” is bounded away from , we determine an explicit upper bound on the number of iterations until convergence to the optimal manifold.
With little effort our analysis of IFBS for Problem SO can be adapted to apply to the splitting inertial proximal method (SIPM) proposed by Moudafi and Oliny [35]. This method is a direct generalization of the heavy ball with friction method (HBF) [36] to proximal splitting problems and differs from IFBS in that the gradient of is computed at rather than . We show that SIPM also achieves local linear convergence for this problem under appropriate parameter constraints.
The paper is organized as follows. In Section LABEL:sec:prelim, notation and assumptions are discussed. In Section LABEL:sec:fistalyap, we precisely define the IFBS family and discuss known convergence results in more detail. In Section LABEL:sec:fistalyapanal we apply our Lyapunov analysis to IFBS. In Section LABEL:sec:l1ls we derive convergence results for Problem SO. Finally, numerical experiments are presented in Section LABEL:sec:sims.
2 Preliminaries
2.1 Notation and Definitions
Throughout the paper, is a Hilbert space over the field of real numbers, is the inner product and is the associated norm. Let be the set of all closed, convex and proper functions whose domain is a subset of and range is a subset of . For any and point , we denote by for the enlargement of the subdifferential, defined as the set
\hb@xt@.01(2.1) 
which is always convex and closed and may be empty. We will use to denote . When is a singleton we will call it the gradient at , denoted by .
For and , the notation (resp. ) means there exists a constant such that (resp. ). The notation means . We will say a sequence converges linearly to with rate of convergence , if . To be precise we will occasionally refer to this as asymptotic or local linear convergence. Note that this is different from nonasymptotic, or global linear covergence with rate , in which case there exists a such that for all . In contrast local linear convergence allows for a finite number of iterations where such a relationship does not hold.
Define the optimal value of Problem (LABEL:prob:1) as
and the solution set as
Given a function , we say that the iteration complexity of a method for minimizing is if implies . To be precise we will occasionally refer to this as the asymptotic iteration complexity.
For a matrix and a set , will denote the matrix in formed by taking the columns corresponding to the elements of . For a vector , will denote the vector with entries given by the entries of on the indices corresponding to the elements of , and will denote the vector in equal to on the indices corresponding to and equal to zero everywhere else. Given and , is defined as if and if , is simply applying elementwise. We will use the notation .
2.2 Proximal Operators
The proximal operator w.r.t. a function is defined implicitly by
and explicitly by
\hb@xt@.01(2.2) 
Since the function being minimized in (LABEL:eq:prox_def) is strongly convex and in , exists and is unique for every thus it is a well defined mapping with domain equal to . To be more general we will actually use the enlarged proximal operator, which is the set
which is not necessarily uniquely defined (except when ). Note that for all . The use of allows for some approximation error in the computation of the proximal operator.
2.3 Cocoercivity and Convexity
We say that a Gâteaux differentiable and convex function has a cocoercive gradient with , if
\hb@xt@.01(2.3) 
Note this is equivalent to the gradient being Lipschitz continuous, i.e.
\hb@xt@.01(2.4) 
For a proof see [37] Lemma 1.4 and the BaillonHaddad Theorem [38]. We will need the following two standard properties of such a function. For all :
\hb@xt@.01(2.5) 
and (by convexity)
\hb@xt@.01(2.6) 
We are now ready to formerly state our Assumptions for Problem (LABEL:prob:1).
Assumption 1. and are in , is Gâteaux differentiable everywhere and has a cocoercive gradient with , and .
2.4 Properties of Sparse Optimization
We now outline our assumptions for Problem SO and discuss some of its properties.
Assumption SO. , is twice differentiable everywhere, and has a cocoercive gradient with . and is nonempty.
The main difference between Assumption SO and Assumption 1 is that we additionally assume that is twice differentiable. Let denote the Hessian of at . Then the Lipschitz constant of the gradient is equal to the supremum of the largest eigenvalue of over all . Furthermore note that is in . Finally note that for the function is coercive thus is nonempty.
Problem SO includes Problem LS, defined as
where and . The solution set of Problem LS is always nonempty. The function has gradient equal to which is Lipschitzcontinuous with Lipschitz constant equal to the largest eigenvalue of .
The proximal operator associated with is the shrinkage and softthresholding operator , applied elementwise. It is defined as , and thus
\hb@xt@.01(2.7) 
In the analysis of IFBS applied to Problem SO we will need the following result proved in [17].
Theorem 2.1 (Theorem 2.1 [17])
For problem SO suppose Assumption SO holds, then there exists a vector such that for all , . Furthermore, for all ,
The following two sets also used in [17] will also be crucial to our analysis. Let and Note that and . By Theorem LABEL:thm:constGrad, we can infer that for all . Finally, define
We will need the following Lemma proved in [17].
Lemma 2.2 (Lemma 4.1 [17])
Under Assumption SO, if ,
An alternative definition of cocoercivity is to say that if an operator is cocoercive than is firmly nonexpansive. Thus Lemma LABEL:lemma:nonexpansive is just an elementary property of firmly nonexpansive operators (see Proposition 4.2 (iii), and Proposition 4.33 [39]).
Finally, the following properties of will be useful.
Lemma 2.3 (Lemma 3.2 [17])
Fix any and in , and :

The function is nonexpansive. That is,

If and then

If then and
\hb@xt@.01(2.8)
3 IFbs
To be more general, our global analysis will apply to the following IFBS family.
with chosen arbitrarily. Note that for any ,
We will refer to as the “momentum” parameters and as the “stepsize” parameters. The algorithm differs from IFBS in that it uses the enlarged subdifferential, allowing for some error in the computation of the proximal operator.
3.1 Known Convergence Results
Beck and Teboulle [16] proposed the following choice of parameters for IFBS (IFBS with the set to for all ),
\hb@xt@.01(3.1) 
The method is known as FISTA. With this choice of parameters, Beck and Teboulle showed that the objective function converges to the minimum at the worstcase optimal rate of . In fact the rate holds for a variety of choices of which all have the form: [26]. However the choice in (LABEL:eq:fista_mo2) guarantees the largest possible decrease in a given upper bound of at each iteration. Chambolle and Dossal [27] considered IFBS with a similar choice of to what was proposed by Beck and Teboulle. They investigated, for some ,
\hb@xt@.01(3.2) 
With this choice of parameters, the authors showed that the objective function achieves the optimal convergence rate and in addition weakly converges to a minimizer.
In contrast to [16] and [27], our analysis establishes weak convergence of the iterates for a wide range of parameter choices. Indeed, the momentum sequence is not constrained to follow a particular sequence relationship, but instead must be constrained to and . However we do not guarantee the objective function rate.
Lorenz and Pock [28] generalized IFBS to the problem of finding a zero of the sum of two maximal monotone operators and , one of which is cocoercive. Setting and recovers Problem (LABEL:prob:1). They also replaced the scalar stepsize with a general positive definite operator . Lorenz and Pock proved weak convergence of the iterates to a solution provided certain restrictions on and . The restrictions on are stronger than those derived in our global analysis. In their analysis, if the stepsize is fixed to , is restricted to be less than , whereas, as we shall see in Section LABEL:sec:lyap, our Lyapunov analysis allows , so long as . For the stepsize, their conditions are less restrictive than ours, allowing for values of up to , whereas our analysis only allows up to . However in their analysis larger values of lead to a smaller range of feasible values for reducing to as approaches .
In [29], an inertial version of the classical Krasnosel’skiĭMann (KM) algorithm was analyzed. The KM algorithm finds the fixed points of a nonexpansive operator . Setting the operator in the inertial KM method of [29] recovers IFBS, since a point is a fixed point of if and only if it is a solution of Problem (LABEL:prob:1). The analysis of [29] proves weak convergence of the iterates to a fixed point but relies on verifying conditions of the form: with equal to and . In general this condition must be enforced online, restricting the range of possible choices for the sequence of momentum parameters. However, it was shown in [40] that choosing to be nondecreasing and satisfying with suffices to ensure the condition is satisfied and thus prove weak convergence. This condition is more restrictive than the ones derived in this paper for the special case of Problem (LABEL:prob:1).
3.2 Known Convergence Results for Sparse Optimization
The FISTAlike sequences for defined in (LABEL:eq:fista_mo2) and (LABEL:eq:fista_mo3) both converge to . As we will see in Section LABEL:sec:underD this is not desirable for Problem SO. In the language of dynamical systems, when the momentum is too high the iterates move into an “underdamped regime” leading to oscillations in the objective function and slow convergence (see [34] for an analysis in the stronglyconvex quadratic case). We will show that for Problem SO the FISTAlike choices are not optimal from the viewpoint of asymptotic rate of convergence under a local strongconvexity assumption (Corollary LABEL:cor:FISTAforL1LS) or a strict complimentarity condition (Corollary LABEL:cor:strict_comp).
In [33], the behavior of ISTA and FISTA (i.e. IFBS with parameter choice (LABEL:eq:fista_mo2)) applied to Problem LS was investigated through a spectral analysis. The authors show that both algorithms obtain local linear convergence for this problem, under the condition that the minimizer is unique, but without an estimate for the number of iterations until convergence to the optimal manifold. Furthermore they determine that the local rate of convergence of FISTA is worse than ISTA, while the transient behavior of FISTA is better than ISTA. Therefore they suggest switching from FISTA to ISTA once the optimal manifold has been identified. Our contribution differs in several ways. We note that the poor local performance of the FISTAlike choices is due to having the momentum parameter converge to . Therefore we determine the optimal value for the momentum parameter that should be used in the asymptotic regime which allows for a better asymptotic rate than both ISTA and FISTA and suggest a heuristic method for estimating the optimal momentum. We also show that the adaptive restart method of [34] will achieve the rate in the transient regime and the optimal asymptotic rate. Furthermore our analysis holds for Problem SO with Problem LS as a special case and we do not require the minimizer to be unique. Finally, in the case where , we provide explicit upper bounds on the number of iterations until IFBS has converged to the optimal manifold.
In [41] a method was developed for solving Problem (LABEL:prob:1) when is strongly convex. The method is equivalent to IFBS with the same prescription for as determined by Nesterov for his method for minimizing strongly convex functions (constant scheme 2.2.8. of [25]). However it also includes a backtracking procedure for adjusting and when the strong convexity and Lipschitz gradient parameters are not known. The authors of [41] also extended their method to Problem LS including the case where is not strongly convex. The authors showed that under conditions on the matrix related to the Restricted Isometry Property (RIP) used in compressed sensing, their algorithm obtains nonasymptotic (global) linear convergence, so long as the initial vector is sufficiently sparse. However, as the authors note the RIPlike conditions are much stronger than those typically found in the literature. Indeed the conditions are much stronger than those required in our proof of local linear convergence. We establish that IFBS obtains local linear convergence regardless of the initialization point. Furthermore no RIPlike assumptions are necessary. Local linear convergence can be proved under the mild condition that the smallest eigenvalue of the Hessian restricted to the support of a minimum is nonzero at the minimum point. Or if this does not hold, under a common strictcomplementarity condition (see Section LABEL:sec:paramsfist). That being said, it should be noted that local linear convergence is not as strong a statement as global linear convergence
4 A Global Analysis of IFBS
This section derives conditions on , and which imply weak convergence of the iterates of IFBS to a minimizer of Problem (LABEL:prob:1). Throughout the rest of the paper, let denote . Given , define .
Theorem 4.1
Suppose that Assumption 1 holds. Assume is nondecreasing and satisfies for all , and satisfies for all and . If for all and , then for the iterates of IFBS, we have

.

.

If, in addition, is nonempty, then converges weakly to some .
Proof. The proof consists of two parts. In the first, we prove statements (LABEL:state:finitesum) and (LABEL:state:convgrad) using arguments inspired by Alvarez’ analysis of the inertial proximal method in [42]. In the second part, we invoke Opial’s Lemma [43] to prove statement (LABEL:state:weak). The second part is inspired by the analysis of the splitting inertial proximal algorithm by Moudafi and Oliny in [35].
Proof of statements (LABEL:state:finitesum) and (LABEL:state:convgrad)
Define the Lyapunov function, or discrete energy, to be
Note that this is the same energy function used by Alvarez [44]. Inequalities (LABEL:eq:ineq1), (LABEL:eq:ineq2) and (LABEL:eq:ineq3) imply
\hb@xt@.01(4.1)  
Note that the existence of a subgradient is guaranteed because the enlarged proximal operator has domain equal to . Using (LABEL:eq:lyap1), the  update in IFBS and the fact that , we infer that
Moving terms to the other side and summing implies, for all ,
\hb@xt@.01(4.2)  
Inequality (LABEL:eq:sumfinite) along with the assumptions on and imply statement (LABEL:state:finitesum). Statement (LABEL:state:finitesum) implies , therefore via the  update of IFBS. This implies that , because . Finally, using the  update of IFBS we infer that
\hb@xt@.01(4.3) 
Proof of statement (LABEL:state:weak)
If is a subsequence which weakly converges to , then the update of IFBS implies also weakly converges to . This, combined with the update implies that . Suppose that for any , the sequence has a limit. This implies the sequence is bounded and therefore it has at least one weaklyconvergent subsequence, (ordinary convergence in ). By the above reasoning the limit of this subsequence, must be in . Furthermore exists. Consider another subsequence which converges to . By considering the fact that and the corresponding statement for , one can see that . Therefore the set of weakly convergent subsequences is the singleton . Thus weakly converges to (This is Opial’s Lemma [43]).
Assume is nonempty. We now proceed to show that, for any , the sequence has a limit. Our proof closely follows Moudafi and Oliny’s analysis [35], and is similar to the later variants [27, 28]. The main difference is we allow for to be for a finite number of iterations. Fix and define . Now
\hb@xt@.01(4.4) 
Since
and , it follows that
\hb@xt@.01(4.5) 
Combining (LABEL:eq:mod1) and (LABEL:eq:mod2) we obtain
\hb@xt@.01(4.6)  
Now
\hb@xt@.01(4.7)  
Combining (LABEL:eq:mod3) and (LABEL:eq:mod4) yields
\hb@xt@.01(4.8)  
Now we use the fact that is cocoercive as follows. Inequality (LABEL:eq:coercdef) implies
\hb@xt@.01(4.9)  
where (LABEL:eq:coerc) follows by completing the square. Combining (LABEL:eq:mod5) and (LABEL:eq:coerc) we infer
\hb@xt@.01(4.10)  
Note that the coefficients of and are nonpositive. Set and
\hb@xt@.01(4.11) 
and note that .
The argument from now on is basically identical to [35] except we allow for sequences which are equal to for a finite number of . Restate (LABEL:eq:recurse1) as
\hb@xt@.01(4.12)  
Since , there exists an integer and such that for all . This and (LABEL:eq:recurs2) imply that, for
Thus for
Careful examination of this expression yields
\hb@xt@.01(4.13) 
Set . Since and , is bounded from below. is nonincreasing, therefore we have it converges. Therefore converges for every