Nonsmooth optimization using Taylorlike models: error bounds, convergence, and termination criteria
Abstract
We consider optimization algorithms that successively minimize simple Taylorlike models of the objective function. Methods of GaussNewton type for minimizing the composition of a convex function and a smooth map are common examples. Our main result is an explicit relationship between the stepsize of any such algorithm and the slope of the function at a nearby point. Consequently, we (1) show that the stepsizes can be reliably used to terminate the algorithm, (2) prove that as long as the stepsizes tend to zero, every limit point of the iterates is stationary, and (3) show that conditions, akin to classical quadratic growth, imply that the stepsizes linearly bound the distance of the iterates to the solution set. The latter socalled error bound property is typically used to establish linear (or faster) convergence guarantees. Analogous results hold when the stepsize is replaced by the square root of the decrease in the model’s value. We complete the paper with extensions to when the models are minimized only inexactly.
Comment[]
Keywords: Taylorlike model, errorbound, slope, subregularity, KurdykaŁojasiewicz inequality, Ekeland’s principle
AMS 2010 Subject Classification: 65K05, 90C30, 49M37, 65K10
1 Introduction
A basic algorithmic strategy for minimizing a function on is to successively minimize simple “models” of the function, agreeing with at least up to firstorder near the current iterate. We will broadly refer to such models as “Taylorlike”. Some classical examples will help ground the exposition. When is smooth, common algorithms given a current iterate declare the next iterate to be a minimizer of the quadratic model
When the matrix is a multiple of the identity, the scheme reduces to gradient descent; when is the Hessian , one recovers Newton’s method; adaptively changing based on accumulated information covers QuasiNewton algorithms. Higherorder models can also appear; the cubicly regularized Newton’s method of NesterovPolyak [32] uses the models
For more details on Taylorlike models in smooth minimization, see NocedalWright [37].
The algorithmic strategy generalizes far beyond smooth minimization. One important arena, and the motivation for the current work, is the class of convex composite problems
(1.1) 
Here is a closed convex function (possibly taking infinite values), is a finitevalued Lipschitz convex function, and is a smooth map. Algorithms for this problem class have been studied extensively, notably in [40, 7, 47, 46, 20, 41] and more recently in [27, 17, 10]. Given a current iterate , common algorithms declare the next iterate to be a minimizer of
(1.2) 
The underlying assumption is that the minimizer of can be efficiently computed. This is the case for example, when interiorpoint methods can be directly applied to the convex subproblem or when evaluating and is already the computational bottleneck. The latter setting is ubiquitous in derivative free optimization; see for example the discussion in Wild [45]. The model in (1.2) is indeed Taylorlike, even when and are nonconvex, since the inequality holds for all points , as the reader can readily verify. When is a multiple of the identity, the resulting method is called the “proxlinear algorithm” in [17, 27], and it subsumes a great variety of schemes.
In the setting , the proxlinear algorithm reduces to the proximalpoint method on the function [42, 30, 31]. When maps to the real line and is the identity function, the scheme is the proximal gradient algorithm on the function [36, 3]. The LevenbergMarquardt algorithm [29] for nonlinear least squares – a close variant of GaussNewton – corresponds to setting and . Allowing to vary with accumulated information yields variable metric variants of the aforementioned algorithms; see e.g. [44, 9, 7]. Extensions where and are not necessarily convex, but are nonetheless simple, are also important and interesting, in large part because of nonconvex penalties and regularizers common in machine learning applications. Other important variants interlace the model minimization step with inertial corrector steps, such as in accelerated gradient methods [33, 21], cubically regularized Newton [35], and convex composite algorithms [16].
In this work, we take a broader view of nonsmooth optimization algorithms that use Taylorlike models. Setting the stage, consider the minimization problem
for an arbitrary lowersemicontinuous function on . The modelbased algorithms we consider simply iterate the steps: is a minimizer of some model based at . In light of the discussion above, we assume that the models approximate (uniformly) up to firstorder, meaning
where is any smooth function satisfying . For uses of a wider class of models for bundle methods, based on cutting planes, see NollProtRondepierre[38].
In this great generality, we begin with the following basic question.
When should one terminate an algorithm that uses Taylorlike models?
For smooth nonconvex optimization, the traditional way to reliably terminate the algorithm is to stop when the norm of the gradient at the current iterate is smaller than some tolerance. For nonsmooth problems, termination criteria based on optimality conditions along the iterates may be meaningless as they may never be satisfied even in the limit. For example, one can easily exhibit a convex composite problem so that the iterates generated by the proxlinear algorithm described above converge to a stationary point, while the optimality conditions at the iterates are not satisfied even in the limit.
There are, on the other hand, two appealing stopping criteria one can try: terminate the algorithm when either the stepsize or the model improvement is sufficiently small. We will prove that both of these simple termination criteria are indeed reliable in the following sense. Theorem 3.1 and Corollary 5.4 show that if either the stepsize or the model improvement is small, then there exists a point close to in both distance and in function value, which is nearly stationary for the problem. Determining the point is not important; the only role of is to certify that the current iterate is “close to nearstationarity” in the sense above. Theorem 3.1 follows quickly from Ekeland’s variational principle [19] – a standard variational analytic tool. For other uses of the technique in variational analysis, see for example the survey [23]. Stopping criterion based on small nearby subgradients has appeared in many other contexts such as in descent methods of [22] and gradient sampling schemes of [8].
Two interesting consequences for convergence analysis flow from there.
Suppose that the models are chosen in such a way that the stepsizes tend to zero. This assumption is often enforced by ensuring that is smaller than by at least a multiple of (a sufficient decrease condition) using a linesearch procedure or by safeguarding the minimal eigenvalue of .
Then assuming for simplicity that is continuous on its domain, any limit point of the iterate sequence will be stationary for the problem (Corollary 3.3).
The subsequence convergence result is satisfying, since very little is assumed about the underlying algorithm. A finer analysis of linear, or faster, convergence rates relies on some regularity of the function near a limit point of the iterate sequence . One of the weakest such regularity assumptions is that for all near , the “slope” of at linearly bounds the distance of to the set of stationary points – the “error”. Here, we call this property the slope errorbound. To put it in perspective, we note that the slope errorbound always entails a classical quadratic growth condition away from (see [13, 18]), and is equivalent to it whenever is convex (see [1, 25]). Moreover, as an aside, we observe in Theorem 3.7 and Proposition 3.8 that under mild conditions, the slope errorbound is equivalent to the “KurdykaŁojasiewicz inequality” with exponent – an influential condition also often used to prove linear convergence.
Assuming the slope errorbound, a typical convergence analysis strategy aims to deduce that the stepsizes linearly bound the distance . Following LuoTseng [28], we call the latter property the stepsize errorbound. We show in Theorem 3.5 that the slope errorbound indeed always implies the stepsize errorbound, under the common assumption that the growth function is a quadratic. The proof is a straightforward consequence of the relationship we have established between the stepsize and the slope at a nearby point – underscoring the power of the technique.
In practice, exact minimization of the model function can be impossible. Instead, one can obtain a point that is only nearly optimal or nearly stationary for the problem . Section 5 shows that all the results above generalize to this more realistic setting. In particular, somewhat surprisingly, we argue that limit points of the iterates will be stationary even if the tolerances on optimality (or stationarity) and the stepsizes tend to zero at independent rates. The arguments in this inexact setting follow by applying the key result, Theorem 3.1, to small perturbations of and , thus illustrating the flexibility of the theorem.
The convex composite problem (1.1) and the proxlinear algorithm (along with its variable metric variants) is a fertile application arena for the techniques developed here. An early variant of the key Theorem 3.1 in this setting appeared recently in [17, Theorem 5.3] and was used there to establish sublinear and linear convergence guarantees for the proxlinear method. We review these results and derive extensions in Section 4, as an illustration of our techniques. An important deviation of ours from earlier work is the use of the stepsize as the fundamental analytic tool, in contrast to the measures of Burke [7] and the criticality measures in CartisGouldToint [10]. To the best of our knowledge, the derived relationship between the stepsize and stationarity at a nearby point is entirely new. The fact that the slope errorbound implies that both the stepsize and the square root of the model improvement linearly bounds the distance to the solution set (stepsize and model errorbounds) is entirely new as well; previous related results have relied on polyhedrality assumptions on .
Though the discussion above takes place over the Euclidean space , the most appropriate setting for most of our development is over an arbitrary complete metric space. This is the setting of the paper. The outline is as follows. In Section 2, we establish basic notation and recall Ekeland’s variational principle. Section 3 contains our main results. Section 4 instantiates the techniques for the proxlinear algorithm in composite minimization, while Section 5 explores extensions when the subproblems are solved inexactly.
2 Notation
Fix a complete metric space with the metric . We denote the open unit ball of radius around a point by . The distance from to a set is
We will be interested in minimizing functions mapping to the extended real line . A function is called lowersemicontinuous (or closed) if the inequality holds for all points .
Consider a closed function and a point with finite. The slope of at is simply its maximal instantaneous rate of decrease:
Here, we use the notation . If is a differentiable function on a Euclidean space, the slope simply coincides with the norm of the gradient , and hence the notation. For a convex function , the slope equals the norm of the shortest subgradient . For more details on the slope and its uses in optimization, see the survey [23] or the thesis [12].
The function lacks basic lowersemicontinuity properties. As a result when speaking of algorithms, it is important to introduce the limiting slope
In particular, if is continuous on its domain, then is simply the lowersemicontinuous envelope of . We say that a point is stationary for if equality holds.
We will be interested in locally approximating functions up to firstorder. Seeking to measure the “error in approximation”, we introduce the following definition.
Definition 2.1 (Growth function).
A differentiable univariate function is called a growth function if it satisfies and on . If in addition, equalities hold, we say that is a proper growth function.
The main examples of proper growth functions are for real and .
The following result, proved in [19], will be our main tool. The gist of the theorem is that if a point nearly minimizes a closed function, then is close to a a true minimizer of a slightly perturbed function.
Theorem 2.2 (Ekeland’s variational principle).
Consider a closed function that is bounded from below. Suppose that for some and , we have . Then for any real , there exists a point satisfying

,

,

is the unique minimizer of the perturbed function .
Notice that property 3 in Ekeland’s principle directly implies the inequality . Thus if a point nearly minimizes , then the slope of is small at some nearby point.
2.1 Slope and subdifferentials
The slope is a purely metric creature. However, for a function on , the slope is closely related to “subdifferentials”, which may be more familiar to the audience. We explain the relationship here following [23]. Since the discussion will not be used in the sequel, the reader can safely skip it and move on to Section 3.
A vector is called a Fréchet subgradient of a function at a point if the inequality
The set of all Fréchet subgradients of at is the Fréchet subdifferential and is denoted by . The connection of the slope to subgradients is immediate. A vector lies in if and only if the slope of the linearly tilted function at is zero. Moreover the inequality
(2.1) 
The limiting subdifferential of at , denoted , consists of all vectors such that there exists sequences and satisfying . Assuming that is closed, a vector lies in if and only if the limiting slope of the linearly tilted function at is zero. Moreover, Proposition 4.6 in [14] shows that the exact equality
(2.2) 
In particular, stationarity of at amounts to the inclusion .
3 Main results
For the rest of the paper, fix a closed function on a complete metric space , and a point with finite. The following theorem is our main result. It shows that for any function (the “model”), such that the error in approximation is controlled by a growth function of the norm , the distance between and the minimizer of prescribes nearstationarity of at some nearby point .
Theorem 3.1 (Perturbation result).
Consider a closed function such that the inequality
where is some growth function, and let be a minimizer of . If coincides with , then the slope is zero. On the other hand, if and are distinct, then there exists a point satisfying

(point proximity) ,

(value proximity) ,

(nearstationarity) .
Proof.
A quick computation shows the equality . Thus if coincides with , the slope must be zero, as claimed. Therefore, for the remainder of the proof, we will assume that and are distinct.
Observe now the inequality
Define the function and note . We deduce
(3.1) 
An easy argument now shows the inequality
Setting and applying Ekeland’s variational principle (Theorem 2.2), we obtain for any a point satisfying
We conclude Setting yields the result. ∎
Note that the distance appears on the right handside of the nearstationarity property. By the triangleinequality and point proximity, however, it can be upper bounded by , a quantity independent of .
To better internalize this result, let us look at the most important setting of Theorem 3.1 where the growth function is a quadratic for some real .
Corollary 3.2 (Quadratic error).
Consider a closed function and suppose that with some real the inequality
Define to be the minimizer of . Then there exists a point satisfying

(point proximity) ,

(value proximity) ,

(nearstationarity) .
An immediate consequence of Theorem 3.1 is the following subsequence convergence result.
Corollary 3.3 (Subsequence convergence to stationary points).
Consider a sequence of points and closed functions satisfying and . Suppose moreover that the inequality
where is a proper growth function. If is a limit point of the sequence , then is stationary for .
Proof.
Fix a subsequence with , and consider the points guaranteed to exist by Theorem 3.1. By point proximity, , and the fact that the right handside tends to zero, we conclude that converge to . The functional proximity, implies . Lowersemicontinuity of then implies . Finally, the nearstationarity,
implies . Thus is a stationary point of . ∎
Remark 3.4 (Asymptotic convergence to critical points).
Corollary 3.3 is fairly satisfying since very little is assumed about the model functions. More sophisticated linear, or faster, rates of convergence rely on some regularity of the function near a limit point of the iterate sequence . Let denote the set of stationary points of . One of the weakest such regularity assumptions is that the slope linearly bounds the distance for all near . Indeed, this property, which we call the slope errorbound, always entails a classical quadratic growth condition away from (see [13, 18]), and is equivalent to it whenever is a convex function on (see [1, 25]).
Assuming such regularity, a typical convergence analysis strategy thoroughly explored by LuoTseng [28], aims to deduce that the stepsizes linearly bound the distance . The latter is called the stepsize errorbound property. We now show that slope errorbound always implies the stepsize errorbound, under the mild and natural assumption that the models deviate form by a quadratic error in the distance.
Theorem 3.5 (Slope and stepsize errorbounds).
Let be an arbitrary set and fix a point satisfying the condition

(Slope errorbound).
Consider a closed function and suppose that for some the inequality
Then letting be any minimizer of , the following holds:

(Stepsize errorbound)
Proof.
Suppose that the points and lie in . Let be the point guaranteed to exist by Corollary 3.2. We deduce
Thus lies in and we obtain
Taking into account the inequality , we conclude
as claimed. ∎
Remark 3.6 (Slope and subdifferential errorbounds).
It is instructive to put the slope errorbound property in perspective for those more familiar with subdifferentials. To this end, suppose that is defined on and consider the subdifferential errorbound condition
(3.2) 
Clearly in light of the inequality (2.1), the slope errorbound implies the subdifferential errorbound (3.2). Indeed, the slope and subdifferential errorbounds are equivalent. To see this, suppose (3.2) holds and consider an arbitrary point . Appealing to the equality (2.2), we obtain sequences and satisfying and . Inequality (3.2) then implies for each sufficiently large index . Letting tend to infinity yields the inequality, , and therefore the slope errorbound is valid.
Lately, a different condition called the KurdykaŁosiewicz inequality [4, 26] with exponent 1/2 has been often used to study linear rates of convergence. The manuscripts [2, 6] are influential examples. We finish the section with the observation that the KurdykaŁosiewicz inequality always implies the slope errorbound relative to a sublevel set ; that is, the KŁ inequality is no more general than the slope errorbound. A different argument for (semi) convex functions based on subgradient flow appears in [5, Theorem 5] and [24]. In Proposition 3.8 we will also observe that the converse implication holds for all proxregular functions. Henceforth, we will use the sublevel set notation and similarly .
Theorem 3.7 (KŁinequality implies the slope errorbound).
Suppose that there is a nonempty open set in such that the inequalities
where , , , and are real numbers. Then there exists a nonempty open set and a real number so that the inequalities
In the case , we can ensure and .
Proof.
Define the function . Note the inequality for all . Let be strictly smaller than the largest radius of a ball contained in and define . Define the nonempty set and fix a point .
The converse of Theorem 3.7 holds for “proxregular functions” on , and in particular for “lower functions”. The latter are functions on such that around each point there is a neighborhood and a real such that is convex on .
Proposition 3.8 (Slope errorbound implies KŁinequality under proxregularity).
Consider a closed function . Fix a real number and a nonempty set . Suppose that there is a set , and constants , , , and such that the inequalities
hold for all , , and . Then the inequalities
hold for all where we set .
Proof.
Consider a point . Suppose first . Then we deduce , as claimed. Hence we may suppose there exists a subgradient . We deduce
Choosing , such that and attain and , respectively, we deduce . The result follows. ∎
4 Illustration: convex composite minimization
In this section, we briefly illustrate the results of the previous section in the context of composite minimization, and recall some consequences already derived in [17] from preliminary versions of the material presented in the current paper. This section will not be used in the rest of the paper, and so the reader can safely skip it if needed.
The notation and much of discussion follows that set out in [17]. Consider the minimization problem
where is a closed convex function, is a finitevalued Lipschitz convex function, and is a smooth map with the Jacobian that is Lipschitz continuous. Define the model function
One can readily verify the inequality
In particular, the models are “Taylorlike”. The proxlinear algorithm iterates the steps
(4.1) 
The following is a rudimentary convergence guarantee of the scheme [17, Section 5]:
(4.2) 
where is the limit of the decreasing sequence . In particular, the stepsizes tend to zero. Moreover, one can readily verify that for any limit point of the iterate sequence , equality holds. Consequently, by Corollary 3.3, the point is stationary for :
We note that stationarity of the limit point is wellknown and can be proved by other means; see for example the discussion in [10]. From (4.2), one also concludes the rate
What is the relationship of this rate to nearstationary of the iterate ? Corollary 3.2 shows that after iterations, one is guaranteed to find an iterate such that there exists a point satisfying
Let us now move on to linear rates of convergence. Fix a limit point of the iterate sequence and let be the set of stationary points of . Then Theorem 3.5 shows that the regularity condition

(Slope errorbound) .
implies

(Stepsize errorbound)
Additionally, in the next section (Corollary 5.7) we will show that the slope errorbound also implies

(Model errorbound)
whenever and lies in .
It was, in fact, proved in [17, Theorem 5.10] that the slope and stepsize error bounds are equivalent up to a change of constants. Moreover, as advertised in the introduction, the above implications were used in [17, Theorem 5.5] to show that if the slope errorbound holds then the function values converge linearly:
where
The rate improves to under a stronger regularity condition, called tiltstability [17, Theorem 6.3]; the argument of the better rate again crucially employs a comparison of steplengths and subgradients at nearby points.
Our underlying assumption is that the models are easy to minimize, by an interior point method for example. This assumption may not be realistic in some largescale applications. Instead, one must solve the subproblems (4.1) inexactly by a firstorder method. How should one choose the accuracy?
Fix a sequence of tolerances and suppose that each iterate satisfies
Then one can establish the estimate
where is the true minimizer of . The details will appear in a forthcoming paper. Solving (4.1) to accuracy can be done in multiple ways using a saddle point reformulation. We treat the simplest case here, where is a support function of a closed bounded set – a common setting in applications. We can then write
Dual to the problem is the maximization problem
For such problems, there are primaldual methods [34, Method 2] that generate iterates and satisfying
after iterations. The cost of each iteration is dominated by a matrix vector multiplication, a projection onto , and a proximal operation of . Assuming that and are bounded throughout the algorithm, to obtain an accurate solution to the subproblem (4.1) requires on the order of such basic operations. Setting for some , we deduce . Thus we can find a point satisfying after at most on the order of