iPiano: Inertial Proximal Algorithm for Nonconvex Optimization
Abstract
In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly nonconvex) and a convex (possibly nondifferentiable) function. The algorithm iPiano combines forwardbackward splitting with an inertial force. It can be seen as a nonsmooth split version of the Heavyball method from Polyak. A rigorous analysis of the algorithm for the proposed class of problems yields global convergence of the function values and the arguments. This makes the algorithm robust for usage on nonconvex problems. The convergence result is obtained based on the KurdykaŁojasiewicz inequality. This is a very weak restriction, which was used to prove convergence for several other gradient methods. First, an abstract convergence theorem for a generic algorithm is proved, and, then iPiano is shown to satisfy the requirements of this theorem. Furthermore, a convergence rate is established for the general problem class. We demonstrate iPiano on computer vision problems: image denoising with learned priors and diffusion based image compression.
Key words. nonconvex optimization, Heavyball method, inertial forwardbackward splitting, KurdykaŁojasiewicz inequality, proof of convergence
1 Introduction
The gradient method is certainly one of the most fundamental but also one of the most simple algorithms to solve smooth convex optimization problems. In the last decades, the gradient method has been modified in many ways. One of those improvements is to consider socalled multistep schemes [38, 35]. It has been shown that such schemes significantly boost the performance of the plain gradient method. Triggered by practical problems in signal processing, image processing and machine learning, there has been an increased interest in socalled composite objective functions, where the objective function is given by the sum of a smooth function and a nonsmooth function with an easy to compute proximal map. This initiated the development of the socalled proximal gradient or forwardbackward method [28], that combines explicit (forward) gradient steps w.r.t. the smooth part with proximal (backward) steps w.r.t. the nonsmooth part.
In this paper, we combine the concepts of multistep schemes and the proximal gradient method to efficiently solve a certain class of nonconvex, nonsmooth optimization problems. Although, the transfer of knowledge from convex optimization to nonconvex problems is very challenging, it aspires to find efficient algorithms for certain nonconvex problems. Therefore, we consider the subclass of nonconvex problems
where is a convex (possibly nonsmooth) and is a smooth (possibly nonconvex) function. The sum comprises nonsmooth, nonconvex functions. Despite the nonconvexity, the structure of being smooth and being convex makes the forwardbackward splitting algorithm welldefined. Additionally, an inertial force is incorporated into the design of our algorithm, which we termed iPiano. Informally, the update scheme of the algorithm that will be analyzed is
where and are the step size parameters. The term is referred as forward step, as inertial term, and as backward or proximal step.
For the proximal step is the identity and the update scheme is usually referred as Heavyball method. This reduced iterative scheme is an explicit finite differences discretization of the socalled Heavyball with friction dynamical system
It arises when Newton’s law is applied to a point subject to a constant friction (of the velocity ) and a gravity potential . This explains the naming “Heavyball method” and the interpretation of as inertial force.
Setting results in the forwardbackward splitting algorithm, which has the nice property that in each iteration the function value decreases. Our convergence analysis reveals that the additional inertial term prevents our algorithm from monotonically decreasing the function values. Although this may look like a limitation on first glance, demanding monotonically decreasing function values anyway is too strict as it does not allow for provably optimal schemes. We refer to a statement of Nesterov [35]: “In convex optimization the optimal methods never rely on relaxation. Firstly, for some problem classes this property is too expensive. Secondly, the schemes and efficiency estimates of optimal methods are derived from some global topological properties of convex functions”^{1}^{1}1Relaxation is to be interpreted as the property of monotonically decreasing function values in this context. Topological properties should be associated with geometrical properties.. The negative side of better efficiency estimates of an algorithm is usually the convergence analysis. This is even true for convex functions. In case of nonconvex and nonsmooth functions, this problem becomes even more severe.
Contributions
Despite this problem, we can establish convergence of the sequence of function values for the general case, where the objective function is only required to be a composition of a convex and a differentiable function. Regarding the sequence of arguments generated by the algorithm, existence of a converging subsequence is shown. Furthermore, we show that each limit point is a critical point of the objective function.
To establish convergence of the whole sequence in the nonconvex case is very hard. However, with slightly more assumptions to the objective, namely that it satisfies the KurdykaŁojasiewicz inequality [30, 31, 26], several algorithms have been shown to converge [14, 5, 3, 4]. In [5] an abstract convergence theorem for descent methods with certain properties is proved. It applies to many algorithms. However, it can not be used for our algorithm. Based on their analysis, we prove an abstract convergence theorem for a different class of descent methods, which applies to iPiano. By verifying the requirements of this abstract convergence theorem, we manage to also show such a strong convergence result. From a practical point of view of image processing, computer vision, or machine learning, the KurdykaŁojasiewicz inequality is almost always satisfied. For more details about properties of KurdykaŁojasiewicz functions and a taxonomy of functions that have this property, we refer to [5, 10, 26].
The last part of the paper is devoted to experiments. We exemplarily present results on computer vision tasks, such as denoising and image compression, and show that entering the staggering world of nonconvex functions pays off in practice.
2 Related Work
Forwardbackward splitting
In convex optimization, splitting algorithms usually originate from the proximal point algorithm [39]. It is a very general algorithm, and results on its convergence affect many other algorithms. Practically, however, computing one iteration of the algorithm can be as hard as the original problem. Among the strategies to tackle this problem are splitting approaches like DouglasRachford [28, 18], several primaldual algorithms [12, 37, 23], and forwardbackward splitting [28, 16, 7, 35]; see [15] for a survey.
Especially the forwardbackward splitting schemes seem to be appealing to generalize to nonconvex problems. This is due to their simplicity and the existence of simpler formulations in some special cases like, for example, the gradient projection method, where the backwardstep is the projection onto a set [27, 22]. In [19] the classical forwardbackward algorithm, where the backward step is the solution of a proximal term involving a convex function, is studied for a nonconvex problem. In fact, the same class of objective functions as in the present paper is analyzed. The algorithm presented here comprises the algorithm from [19] as a special case. Also Nesterov [36] briefly accounts this algorithm in a general setting. Even the reverse setting is generalized in the nonconvex setting [5, 11], namely where the backwardstep is performed on a nonsmooth nonconvex function.
As the amount of data to be processed is growing and algorithms are supposed to exploit all the data in each iteration, inexact methods become interesting, though we do not consider erroneous estimates in this paper. Forwardbackward splitting schemes also seem to work for nonconvex problems with erroneous estimates [44, 43]. A mathematical analysis of inexact methods can be found, e.g., in [14, 5], but with the restriction that the method is explicitly required to decrease the function values in each iteration. The restriction comes with significantly improved results with regard of the convergence of the algorithm. The algorithm proposed in this paper provides strong convergence results, although it does not require the function values to decrease.
Optimization with inertial forces
In his seminal work [38], Polyak investigates multistep schemes to accelerate the gradient method. It turns out that a particularly interesting case is given by a twostep algorithm, which has been coined the Heavyball method. The name of the method is because it can be interpreted as an explicit finite differences discretization of the socalled Heavyball with friction dynamical system. It differs from the usual gradient method by adding an inertial term that is computed by the difference of the two preceding iterations. Polyak showed that this method can speed up convergence in comparison to the standard gradient method, while the cost of each iteration stays basically unchanged.
The popular accelerated gradient method of Nesterov [35] obviously shares some similarities with the Heavyball method, but it differs from it in one regard: while the Heavyball method uses gradients based on the current iterate, Nesterov’s accelerated gradient method evaluates the gradient at points that are extrapolated by the inertial force. On strongly convex functions, both methods are equally fast (up to constants), but Nesterov’s accelerated gradient method converges much faster on weakly convex functions [17].
The Heavyball method requires knowledge about the function parameters (Lipschitz constant of the gradient and the modulus of strong convexity) to achieve the optimal convergence rate, which can be seen as a disadvantage. Interestingly, the conjugate gradient method for minimizing strictly convex quadratic problems can be expressed as Heavyball method. Hence, it can be seen as a special case of the Heavyball method for quadratic problems. In this special case, no additional knowledge is required about the function parameters, as the algorithm parameters are computed online.
The Heavyball method was originally proposed for minimizing differentiable convex functions, but it has been generalized in different ways. In [45], it has been generalized to the case of smooth nonconvex functions. It is shown that, by considering an appropriate Lyapunov objective function, the iterations are attracted by the connected components of stationary points. In Section LABEL:sec:alg it will become evident that the nonconvex Heavyball method is a special case of our algorithm, and also the convergence analysis of [45] shows some similarities to ours.
3 An abstract convergence result
3.1 Preliminaries
We consider the Euclidean vector space of dimension and denote the standard inner product by and the induced norm by . Let be a proper lower semicontinuous function.
Definition 3.1 (effective domain, proper)
The (effective) domain of is defined by . The function is called proper, if is nonempty.
In order to give a sound description of the first order optimality condition for a nonconvex nonsmooth optimization problem, we have to introduce the generalization of the subdifferential for convex functions.
Definition 3.2 (Limitingsubdifferential)
The limitingsubdifferential (or simply subdifferential) is defined by (see [40, Def. 8.3])
\hb@xt@.01(3.1) 
which makes use of the Fréchet subdifferential defined by
when and by else.
The domain of the subdifferential is .
In what follows, we will consider the problem of finding a critical point of , which is characterized by the necessary firstorder optimality condition .
We state the definition of the KurdykaŁojasiewicz property from [4].
Definition 3.3 (KurdykaŁojasiewicz property)

The function has the KurdykaŁojasiewicz property at , if there exist , a neighborhood of and a continuous concave function such that , , for all it is , and for all the KurdykaŁojasiewicz inequality holds, i.e.,

If the function satisfies the KurdykaŁojasiewicz inequality at each point of , it is called KL function.
Roughly speaking, this condition says that we can bound the subgradient of a function from below by a reparametrization of its function values. In the smooth case, we can also say that up to a reparametrization the function is sharp, meaning that any nonzero gradient can be bounded away from . This is sometimes called a desingularization. It has been shown in [4] that a proper lower semicontinuous extended valued function always satisfies this inequality at each nonstationary point. For more details and other interpretations of this property, also for different formulations, we refer to [10].
A big class of functions that have the KLproperty is given by real semialgebraic functions [4]. Real semialgebraic functions are defined as functions whose graph is a real semialgebraic set.
Definition 3.4 (real semialgebraic set)
A subset of is semialgebraic, if there exists a finite number of real polynomials such that
3.2 Inexact descent convergence result for KL functions
In the following, we prove an abstract convergence result for a sequence in , , , satisfying certain basic conditions, . For convenience we use the abbreviation for . We fix two positive constants and and consider a proper lower semicontinuous function . Then, the conditions we require for are

For each , it holds

For each , there exists such that

There exists a subsequence such that
Based on these conditions, we derive the same convergence result as in [5]. The statements and proofs of the subsequent results follow the same ideas as [5]. We modified the involved calculations according to our conditions H1, H2, and H3.
Remark 1
Remark 2
Lemma 3.5
Let be a proper lower semicontinuous function which satisfies the KurdykaŁojasiewicz property at some point . Denote by , and the objects appearing in Definition LABEL:def:KLproperty of the KL property at . Let be such that with , where .
Furthermore, let be a sequence satisfying Conditions H1, H2, and
\hb@xt@.01(3.2) 
Moreover, the initial point is such that and
\hb@xt@.01(3.3) 
Then, the sequence satisfies
\hb@xt@.01(3.4) 
converges to a point such that . If, additionally, Condition H3 is satisfied, then and .
Proof. The key points of the proof are the facts that for all :
\hb@xt@.01(3.5)  
\hb@xt@.01(3.6) 
Let us first see that is welldefined. By Condition H1, is nonincreasing, which shows . Combining this with (LABEL:eq:mteqA) implies .
As for the set is nonempty (see Condition H2) every belongs to . For notational convenience, we define
Now, we want to show that for holds: if and , then
\hb@xt@.01(3.7) 
Obviously, we can assume that (otherwise it is trivial), and therefore H1 and (LABEL:eq:mteqA) imply . The KL inequality shows and H2 shows . Since , using KL inequality and H2, we obtain
As is concave and increasing (), Condition H1 and (LABEL:eq:mteqA) yield
Combining both inequalities results in
which by applying establishes (LABEL:eq:mteqB).
As (LABEL:eq:mteqA) does only imply , , we can not use (LABEL:eq:mteqB) directly for the whole sequence. However, (LABEL:eq:keyA) and (LABEL:eq:keyB) can be shown by induction on . For , (LABEL:eq:mteqA) yields and . From Condition H1 with , and , we infer
\hb@xt@.01(3.8) 
which combined with (LABEL:eq:mteqD) leads to
and therefore . Direct use of (LABEL:eq:mteqB) with shows that (LABEL:eq:keyB) holds with .
Suppose (LABEL:eq:keyA) and (LABEL:eq:keyB) are satisfied for . Then, using the triangle inequality and (LABEL:eq:keyB), we have
which shows, using and (LABEL:eq:mteqD), that . As a consequence (LABEL:eq:mteqB), with , can be added to (LABEL:eq:keyB) and we can conclude (LABEL:eq:keyB) with . This shows the desired induction on .
Now, the finiteness of the length of the sequence , i.e., , is a consequence of the following estimation, which is implied by (LABEL:eq:keyB),
Therefore, converges to some as , and converges to . As is concave, is decreasing. Using this and Condition H2 yields and . Suppose we have , then KLinequality reads for all , which contradicts .
Note that, in general, is not a critical point of , because the limiting subdifferential requires as . When the sequence additionally satisfies Condition H3, then , and is a critical point of , because .
Remark 3
The only difference to [5] with respect to the assumptions is (LABEL:eq:mteqA). In [5], implies , whereas we require and . However, as Theorem LABEL:thm:convabstract shows, this does not weaken the convergence result compared to [5]. In fact, Corollary LABEL:cor:seqballcondition, which assumes for all and which is also used in [5], is key in Theorem LABEL:thm:convabstract.
The next corollary and the subsequent theorem follow as in [5] by replacing the calculation with our conditions.
Corollary 3.6
Lemma LABEL:lem:maintheoremconvergence holds true, if we replace (LABEL:eq:mteqA) by
Proof. By Condition H1, for , we have
Using the triangle inequality on shows that , which implies (LABEL:eq:mteqA) and concludes the proof.
The work that is done in Lemma LABEL:lem:maintheoremconvergence and Corollary LABEL:cor:seqballcondition allows us to formulate an abstract convergence theorem for sequences satisfying the Conditions H1, H2, and H3. It follows, with a few modifications, as in [5].
Theorem 3.7 (Convergence to a critical point)
Let be a proper lower semicontinuous function and a sequence that satisfies H1, H2, and H3. Moreover, let have the KurdykaŁojasiewicz property at the cluster point specified in H3.
Then, the sequence has finite length, i.e., , and converges to as , where is a critical point of .
Proof. By Condition H3, we have and for a subsequence . This, together with the nondecreasingness of (by Condition H1), imply that and for all . The KLproperty around states the existence of quantities , , and as in Definition LABEL:def:KLproperty. Let be such that and . Shrink such that (if necessary). As is continuous, there exists such that for all and
Then, the sequence defined by satsifies the conditions in Corollary LABEL:cor:seqballcondition, which concludes the proof.
4 The proposed algorithm  iPiano
4.1 The optimization problem
We consider a structured nonsmooth nonconvex optimization problem with a proper lower semicontinuous extended valued function , :
\hb@xt@.01(4.1) 
which is composed of a smooth (possibly nonconvex) function with Lipschitz continuous gradient on , , and a convex (possibly nonsmooth) function . Furthermore, we require to be coercive, i.e., implies , and bounded from below by some value .
The proposed algorithm, which is stated in Subsection LABEL:subsec:genericalg, seeks for a critical point of , which is characterized by the necessary firstorder optimality condition . In our case, this is equivalent to
This equivalence is explicitly verified in the next subsection, where we collect some details and state some basic properties, which are used in the convergence analysis in Subsection LABEL:subsec:ncconvana.
4.2 Preliminaries
Consider the function first. It is required to be smooth with Lipschitz continuous gradient on , i.e., there exists a constant such that
\hb@xt@.01(4.2) 
This directly implies that is a nonempty convex set, as . This property of plays a crucial role in our convergence analysis due to the following lemma (stated as in [5]).
Lemma 4.1 (descent lemma)
Let be a function with Lipschitz continuous gradient on . Then for any it holds that
\hb@xt@.01(4.3) 
Proof. See for example [35].
We assume that the function is a proper lower semicontinuous convex function with an efficient to compute proximal map.
Definition 4.2 (proximal map)
Let be a proper lower semicontinuous convex function. Then, we define the proximal map
where is a given parameter, is the identity map, and .
An important (basic) property that the convex function contributes to the convergence analysis is the following:
Lemma 4.3
Let be a proper lower semicontinuous convex function, then it holds for any , that
\hb@xt@.01(4.4) 
Proof. This result follows directly from the convexity of .
Finally, consider the optimality condition more in detail. The following proposition proves the equivalence to . The proof is mainly based on Definition LABEL:def:differential of the limitingsubdifferential.
Proposition 4.4
Let , , and be like before, i.e., let with continuously differentiable and convex. Sometimes, is then called a perturbation of a convex function. Then, for holds
Proof. We first prove “”. Let , i.e., there is a sequence such that , , and , where . We want to show that . As and , we have
It remains to show that . First, remember that is superadditive, i.e., for two sequences , in it is . However, convergence of implies . This fact and again thanks to , we conclude
where and are over . Therefore, .
The other inclusion “” is trivial.
As a consequence, a critical point can also be characterized by the following definition.
Definition 4.5 (proximal residual)
Let and be as afore. Then, we define the proximal residual
It can be easily seen that is equivalent to and , which is the firstorder optimality condition. The proximal residual is defined with respect to a fixed step size of . The rationale behind this becomes obvious when is the indicator function of a convex set. In this case, a small residual could be caused by small step sizes as the reprojection onto the convex set is independent of the step size.
4.3 The generic algorithm
In this paper, we propose an algorithm, iPiano, with the generic formulation in Algorithm LABEL:alg:ipianointro. It is a forwardbackward splitting algorithm incorporating an inertial force. In the forward step, determines the step size in the direction of the gradient of the differentiable function . The step in gradient direction is aggregated with the inertial force from the previous iteration weighted by . Then, the backward step is the solution of the proximity operator for the function with the weight .
In order to make the algorithm specific and convergent, the step size parameters must be chosen appropriately. What “appropriately” means, will be specified in Subsection LABEL:subsec:strategies and proved in Subsection LABEL:subsec:ncconvana.
4.4 Rules for choosing the step size
In this subsection, we propose several strategies for choosing the step sizes. This will make it easier to implement the algorithm. One may choose among the following variants of step size rules depending on the knowledge about the objective function.
Constant step size scheme
The most simple one, which requires most knowledge about the objective function, is outlined in Algorithm LABEL:alg:ipianoconststep. All step size parameters are chosen a priori and are constant.
Remark 4
Observe that our law on is equivalent to the law found in [45] for minimizing a smooth nonconvex function. Hence, our result can be seen as an extension of their work to the presence of an additional nonsmooth convex function.
Backtracking
The case where we have only limited knowledge about the objective function occurs more frequently. It can be very challenging to estimate the Lipschitz constant of beforehand. Using backtracking the Lipschitz constant can be estimated automatically. A sufficient condition that the Lipschitz constant at iteration to must satisfy is
\hb@xt@.01(4.7) 
Although, there are different strategies to determine , the most common one is by defining an increment variable and looking for minimal satisfying (LABEL:eq:BTupcondlip). Sometimes, it is also feasible to decrease the estimated Lipschitz constant after a few iterations. A possible strategy is as follows: if , then search for the minimal satisfying (LABEL:eq:BTupcondlip).
In Algorithm LABEL:alg:ipianoBT we propose an algorithm with variable step sizes. Any strategy for estimating the Lipschitz constant may be used. When changing the Lipschitz constant from one iteration to another, all step size parameters must be adapted. The rules for adapting the step sizes will be justified during the convergence analysis in Subsection LABEL:subsec:ncconvana.
Lazy backtracking
Algorithm LABEL:alg:ipianoBTL presents another alternative of Algorithm LABEL:alg:ipianointro. It is related to Algorithm LABEL:alg:ipianoconststep and LABEL:alg:ipianoBT in the following way. Algorithm LABEL:alg:ipianoBTL makes use of the Lipschitz continuity of in the sense that the Lipschitz constant is always finite. As a consequence, using backtracking with only increasing Lipschitz constants, after a finite number of iterations the estimated Lipschitz constant will not change anymore, and starting from this iteration the constant step size rules as in Algorithm LABEL:alg:ipianoconststep are applied. Using this strategies, the results that will be proved in the convergence analysis are satisfied only as soon as the Lipschitz constant is high enough and does not change anymore.
General rule of choosing the step sizes
Algorithm LABEL:alg:ipianogeneral defines the general rules that the step size parameters must satisfy.
It contains the Algorithms LABEL:alg:ipianoconststep, LABEL:alg:ipianoBT, and LABEL:alg:ipianoBTL as special instances. This is easily verified for Algorithms LABEL:alg:ipianoconststep and LABEL:alg:ipianoBTL. For Algorithm LABEL:alg:ipianoBT the step size rules are derived from the proof of Lemma LABEL:lem:existrelparameter.
As Algorithm LABEL:alg:ipianogeneral is the most general one, now, let us analyze the behavior of this algorithm.
4.5 Convergence analysis
In all what follows, let be the sequence generated by Algorithm LABEL:alg:ipianogeneral and with parameters satisfying the algorithm’s requirements. Furthermore, for a more convenient notation we abbreviate , , and . Note, that for it is .
Let us first verify that the algorithm makes sense. We have to show that the requirements to the parameters are not contradictory, i.e., that it is possible to choose a feasible set of parameters. In the following Lemma, we will only show existence of such a parameter set, however, the proof helps us to formulate specific step size rules.
Lemma 4.6
For all , there are , , and . Furthermore, given , there exists a choice of parameter and such that additionally is monotonically decreasing.
Proof. By the algorithm’s requirements it is
The upper bound for and come from rearranging
to and
, respectively.
The last statement follows by incorporating the descent property of .
Let be chosen initially. Then, the decent property of