Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria

Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria

Abstract

We consider optimization algorithms that successively minimize simple Taylor-like models of the objective function. Methods of Gauss-Newton type for minimizing the composition of a convex function and a smooth map are common examples. Our main result is an explicit relationship between the step-size of any such algorithm and the slope of the function at a nearby point. Consequently, we (1) show that the step-sizes can be reliably used to terminate the algorithm, (2) prove that as long as the step-sizes tend to zero, every limit point of the iterates is stationary, and (3) show that conditions, akin to classical quadratic growth, imply that the step-sizes linearly bound the distance of the iterates to the solution set. The latter so-called error bound property is typically used to establish linear (or faster) convergence guarantees. Analogous results hold when the step-size is replaced by the square root of the decrease in the model’s value. We complete the paper with extensions to when the models are minimized only inexactly.

\SetKwComment

Comment[]


Keywords: Taylor-like model, error-bound, slope, subregularity, Kurdyka-Łojasiewicz inequality, Ekeland’s principle


AMS 2010 Subject Classification: 65K05, 90C30, 49M37, 65K10

1 Introduction

A basic algorithmic strategy for minimizing a function on is to successively minimize simple “models” of the function, agreeing with at least up to first-order near the current iterate. We will broadly refer to such models as “Taylor-like”. Some classical examples will help ground the exposition. When is smooth, common algorithms given a current iterate declare the next iterate to be a minimizer of the quadratic model

When the matrix is a multiple of the identity, the scheme reduces to gradient descent; when is the Hessian , one recovers Newton’s method; adaptively changing based on accumulated information covers Quasi-Newton algorithms. Higher-order models can also appear; the cubicly regularized Newton’s method of Nesterov-Polyak [32] uses the models

For more details on Taylor-like models in smooth minimization, see Nocedal-Wright [37].

The algorithmic strategy generalizes far beyond smooth minimization. One important arena, and the motivation for the current work, is the class of convex composite problems

(1.1)

Here is a closed convex function (possibly taking infinite values), is a finite-valued Lipschitz convex function, and is a smooth map. Algorithms for this problem class have been studied extensively, notably in [40, 7, 47, 46, 20, 41] and more recently in [27, 17, 10]. Given a current iterate , common algorithms declare the next iterate to be a minimizer of

(1.2)

The underlying assumption is that the minimizer of can be efficiently computed. This is the case for example, when interior-point methods can be directly applied to the convex subproblem or when evaluating and is already the computational bottleneck. The latter setting is ubiquitous in derivative free optimization; see for example the discussion in Wild [45]. The model in (1.2) is indeed Taylor-like, even when and are nonconvex, since the inequality holds for all points , as the reader can readily verify. When is a multiple of the identity, the resulting method is called the “prox-linear algorithm” in [17, 27], and it subsumes a great variety of schemes.

In the setting , the prox-linear algorithm reduces to the proximal-point method on the function [42, 30, 31]. When maps to the real line and is the identity function, the scheme is the proximal gradient algorithm on the function [36, 3]. The Levenberg-Marquardt algorithm [29] for nonlinear least squares – a close variant of Gauss-Newton – corresponds to setting and . Allowing to vary with accumulated information yields variable metric variants of the aforementioned algorithms; see e.g. [44, 9, 7]. Extensions where and are not necessarily convex, but are nonetheless simple, are also important and interesting, in large part because of nonconvex penalties and regularizers common in machine learning applications. Other important variants interlace the model minimization step with inertial corrector steps, such as in accelerated gradient methods [33, 21], cubically regularized Newton [35], and convex composite algorithms [16].

In this work, we take a broader view of nonsmooth optimization algorithms that use Taylor-like models. Setting the stage, consider the minimization problem

for an arbitrary lower-semicontinuous function on . The model-based algorithms we consider simply iterate the steps: is a minimizer of some model based at . In light of the discussion above, we assume that the models approximate (uniformly) up to first-order, meaning

where is any -smooth function satisfying . For uses of a wider class of models for bundle methods, based on cutting planes, see Noll-Prot-Rondepierre[38].

In this great generality, we begin with the following basic question.

When should one terminate an algorithm that uses Taylor-like models?

For smooth nonconvex optimization, the traditional way to reliably terminate the algorithm is to stop when the norm of the gradient at the current iterate is smaller than some tolerance. For nonsmooth problems, termination criteria based on optimality conditions along the iterates may be meaningless as they may never be satisfied even in the limit. For example, one can easily exhibit a convex composite problem so that the iterates generated by the prox-linear algorithm described above converge to a stationary point, while the optimality conditions at the iterates are not satisfied even in the limit.1 Such lack of natural stopping criteria for nonsmooth first-order methods has been often remarked (and is one advantage of bundle-type methods).

There are, on the other hand, two appealing stopping criteria one can try: terminate the algorithm when either the step-size or the model improvement is sufficiently small. We will prove that both of these simple termination criteria are indeed reliable in the following sense. Theorem 3.1 and Corollary 5.4 show that if either the step-size or the model improvement is small, then there exists a point close to in both distance and in function value, which is nearly stationary for the problem. Determining the point is not important; the only role of is to certify that the current iterate is “close to near-stationarity” in the sense above. Theorem 3.1 follows quickly from Ekeland’s variational principle [19] – a standard variational analytic tool. For other uses of the technique in variational analysis, see for example the survey [23]. Stopping criterion based on small near-by subgradients has appeared in many other contexts such as in descent methods of [22] and gradient sampling schemes of [8].

Two interesting consequences for convergence analysis flow from there. Suppose that the models are chosen in such a way that the step-sizes tend to zero. This assumption is often enforced by ensuring that is smaller than by at least a multiple of (a sufficient decrease condition) using a line-search procedure or by safeguarding the minimal eigenvalue of . Then assuming for simplicity that is continuous on its domain, any limit point of the iterate sequence will be stationary for the problem (Corollary 3.3).2 Analogous results hold with the step-size replaced by .

The subsequence convergence result is satisfying, since very little is assumed about the underlying algorithm. A finer analysis of linear, or faster, convergence rates relies on some regularity of the function near a limit point of the iterate sequence . One of the weakest such regularity assumptions is that for all near , the “slope” of at linearly bounds the distance of to the set of stationary points – the “error”. Here, we call this property the slope error-bound. To put it in perspective, we note that the slope error-bound always entails a classical quadratic growth condition away from (see [13, 18]), and is equivalent to it whenever is convex (see [1, 25]). Moreover, as an aside, we observe in Theorem 3.7 and Proposition 3.8 that under mild conditions, the slope error-bound is equivalent to the “Kurdyka-Łojasiewicz inequality” with exponent – an influential condition also often used to prove linear convergence.

Assuming the slope error-bound, a typical convergence analysis strategy aims to deduce that the step-sizes linearly bound the distance . Following Luo-Tseng [28], we call the latter property the step-size error-bound. We show in Theorem 3.5 that the slope error-bound indeed always implies the step-size error-bound, under the common assumption that the growth function is a quadratic. The proof is a straightforward consequence of the relationship we have established between the step-size and the slope at a nearby point – underscoring the power of the technique.

In practice, exact minimization of the model function can be impossible. Instead, one can obtain a point that is only nearly optimal or nearly stationary for the problem . Section 5 shows that all the results above generalize to this more realistic setting. In particular, somewhat surprisingly, we argue that limit points of the iterates will be stationary even if the tolerances on optimality (or stationarity) and the step-sizes tend to zero at independent rates. The arguments in this inexact setting follow by applying the key result, Theorem 3.1, to small perturbations of and , thus illustrating the flexibility of the theorem.

The convex composite problem (1.1) and the prox-linear algorithm (along with its variable metric variants) is a fertile application arena for the techniques developed here. An early variant of the key Theorem 3.1 in this setting appeared recently in [17, Theorem 5.3] and was used there to establish sublinear and linear convergence guarantees for the prox-linear method. We review these results and derive extensions in Section 4, as an illustration of our techniques. An important deviation of ours from earlier work is the use of the step-size as the fundamental analytic tool, in contrast to the measures of Burke [7] and the criticality measures in Cartis-Gould-Toint [10]. To the best of our knowledge, the derived relationship between the step-size and stationarity at a nearby point is entirely new. The fact that the slope error-bound implies that both the step-size and the square root of the model improvement linearly bounds the distance to the solution set (step-size and model error-bounds) is entirely new as well; previous related results have relied on polyhedrality assumptions on .

Though the discussion above takes place over the Euclidean space , the most appropriate setting for most of our development is over an arbitrary complete metric space. This is the setting of the paper. The outline is as follows. In Section 2, we establish basic notation and recall Ekeland’s variational principle. Section 3 contains our main results. Section 4 instantiates the techniques for the prox-linear algorithm in composite minimization, while Section 5 explores extensions when the subproblems are solved inexactly.

2 Notation

Fix a complete metric space with the metric . We denote the open unit ball of radius around a point by . The distance from to a set is

We will be interested in minimizing functions mapping to the extended real line . A function is called lower-semicontinuous (or closed) if the inequality holds for all points .

Consider a closed function and a point with finite. The slope of at is simply its maximal instantaneous rate of decrease:

Here, we use the notation . If is a differentiable function on a Euclidean space, the slope simply coincides with the norm of the gradient , and hence the notation. For a convex function , the slope equals the norm of the shortest subgradient . For more details on the slope and its uses in optimization, see the survey [23] or the thesis [12].

The function lacks basic lower-semicontinuity properties. As a result when speaking of algorithms, it is important to introduce the limiting slope

In particular, if is continuous on its domain, then is simply the lower-semicontinuous envelope of . We say that a point is stationary for if equality holds.

We will be interested in locally approximating functions up to first-order. Seeking to measure the “error in approximation”, we introduce the following definition.

Definition 2.1 (Growth function).

A differentiable univariate function is called a growth function if it satisfies and on . If in addition, equalities hold, we say that is a proper growth function.

The main examples of proper growth functions are for real and .

The following result, proved in [19], will be our main tool. The gist of the theorem is that if a point nearly minimizes a closed function, then is close to a a true minimizer of a slightly perturbed function.

Theorem 2.2 (Ekeland’s variational principle).

Consider a closed function that is bounded from below. Suppose that for some and , we have . Then for any real , there exists a point satisfying

  1. ,

  2. ,

  3. is the unique minimizer of the perturbed function .

Notice that property 3 in Ekeland’s principle directly implies the inequality . Thus if a point nearly minimizes , then the slope of is small at some nearby point.

2.1 Slope and subdifferentials

The slope is a purely metric creature. However, for a function on , the slope is closely related to “subdifferentials”, which may be more familiar to the audience. We explain the relationship here following [23]. Since the discussion will not be used in the sequel, the reader can safely skip it and move on to Section 3.

A vector is called a Fréchet subgradient of a function at a point if the inequality

The set of all Fréchet subgradients of at is the Fréchet subdifferential and is denoted by . The connection of the slope to subgradients is immediate. A vector lies in if and only if the slope of the linearly tilted function at is zero. Moreover the inequality

(2.1)

The limiting subdifferential of at , denoted , consists of all vectors such that there exists sequences and satisfying . Assuming that is closed, a vector lies in if and only if the limiting slope of the linearly tilted function at is zero. Moreover, Proposition 4.6 in [14] shows that the exact equality

(2.2)

In particular, stationarity of at amounts to the inclusion .

3 Main results

For the rest of the paper, fix a closed function on a complete metric space , and a point with finite. The following theorem is our main result. It shows that for any function (the “model”), such that the error in approximation is controlled by a growth function of the norm , the distance between and the minimizer of prescribes near-stationarity of at some nearby point .

Theorem 3.1 (Perturbation result).

Consider a closed function such that the inequality

where is some growth function, and let be a minimizer of . If coincides with , then the slope is zero. On the other hand, if and are distinct, then there exists a point satisfying

  1. (point proximity) ,

  2. (value proximity) ,

  3. (near-stationarity) .

Proof.

A quick computation shows the equality . Thus if coincides with , the slope must be zero, as claimed. Therefore, for the remainder of the proof, we will assume that and are distinct.

Observe now the inequality

Define the function and note . We deduce

(3.1)

An easy argument now shows the inequality

Setting and applying Ekeland’s variational principle (Theorem 2.2), we obtain for any a point satisfying

We conclude Setting yields the result. ∎

Note that the distance appears on the right hand-side of the near-stationarity property. By the triangle-inequality and point proximity, however, it can be upper bounded by , a quantity independent of .

To better internalize this result, let us look at the most important setting of Theorem 3.1 where the growth function is a quadratic for some real .

Corollary 3.2 (Quadratic error).

Consider a closed function and suppose that with some real the inequality

Define to be the minimizer of . Then there exists a point satisfying

  1. (point proximity) ,

  2. (value proximity) ,

  3. (near-stationarity) .

An immediate consequence of Theorem 3.1 is the following subsequence convergence result.

Corollary 3.3 (Subsequence convergence to stationary points).

Consider a sequence of points and closed functions satisfying and . Suppose moreover that the inequality

where is a proper growth function. If is a limit point of the sequence , then is stationary for .

Proof.

Fix a subsequence with , and consider the points guaranteed to exist by Theorem 3.1. By point proximity, , and the fact that the right hand-side tends to zero, we conclude that converge to . The functional proximity, implies . Lower-semicontinuity of then implies . Finally, the near-stationarity,

implies . Thus is a stationary point of . ∎

Remark 3.4 (Asymptotic convergence to critical points).

Corollary 3.3 proves something stronger than stated. An unbounded sequence is asymptotically critical for it satisfies . The proof of Corollary 3.3 shows that if the sequence is unbounded, then there exists an asymptotically critical sequence satisfying .

Corollary 3.3 is fairly satisfying since very little is assumed about the model functions. More sophisticated linear, or faster, rates of convergence rely on some regularity of the function near a limit point of the iterate sequence . Let denote the set of stationary points of . One of the weakest such regularity assumptions is that the slope linearly bounds the distance for all near . Indeed, this property, which we call the slope error-bound, always entails a classical quadratic growth condition away from (see [13, 18]), and is equivalent to it whenever is a convex function on (see [1, 25]).

Assuming such regularity, a typical convergence analysis strategy thoroughly explored by Luo-Tseng [28], aims to deduce that the step-sizes linearly bound the distance . The latter is called the step-size error-bound property. We now show that slope error-bound always implies the step-size error-bound, under the mild and natural assumption that the models deviate form by a quadratic error in the distance.

Theorem 3.5 (Slope and step-size error-bounds).

Let be an arbitrary set and fix a point satisfying the condition

  • (Slope error-bound).

Consider a closed function and suppose that for some the inequality

Then letting be any minimizer of , the following holds:

  • (Step-size error-bound)

Proof.

Suppose that the points and lie in . Let be the point guaranteed to exist by Corollary 3.2. We deduce

Thus lies in and we obtain

Taking into account the inequality , we conclude

as claimed. ∎

Remark 3.6 (Slope and subdifferential error-bounds).

It is instructive to put the slope error-bound property in perspective for those more familiar with subdifferentials. To this end, suppose that is defined on and consider the subdifferential error-bound condition

(3.2)

Clearly in light of the inequality (2.1), the slope error-bound implies the subdifferential error-bound (3.2). Indeed, the slope and subdifferential error-bounds are equivalent. To see this, suppose (3.2) holds and consider an arbitrary point . Appealing to the equality (2.2), we obtain sequences and satisfying and . Inequality (3.2) then implies for each sufficiently large index . Letting tend to infinity yields the inequality, , and therefore the slope error-bound is valid.

Lately, a different condition called the Kurdyka-Łosiewicz inequality [4, 26] with exponent 1/2 has been often used to study linear rates of convergence. The manuscripts [2, 6] are influential examples. We finish the section with the observation that the Kurdyka-Łosiewicz inequality always implies the slope error-bound relative to a sublevel set ; that is, the KŁ inequality is no more general than the slope error-bound. A different argument for (semi) convex functions based on subgradient flow appears in [5, Theorem 5] and [24]. In Proposition 3.8 we will also observe that the converse implication holds for all prox-regular functions. Henceforth, we will use the sublevel set notation and similarly .

Theorem 3.7 (KŁ-inequality implies the slope error-bound).

Suppose that there is a nonempty open set in such that the inequalities

where , , , and are real numbers. Then there exists a nonempty open set and a real number so that the inequalities

In the case , we can ensure and .

Proof.

Define the function . Note the inequality for all . Let be strictly smaller than the largest radius of a ball contained in and define . Define the nonempty set and fix a point .

Observe now for any point with , the inclusion holds, and hence . Appealing to [14, Lemma 2.5] (or [23, Chapter 1, Basic Lemma]), we deduce the estimate

The proof is complete. ∎

The converse of Theorem 3.7 holds for “prox-regular functions” on , and in particular for “lower- functions”. The latter are functions on such that around each point there is a neighborhood and a real such that is convex on .

Proposition 3.8 (Slope error-bound implies KŁ-inequality under prox-regularity).

Consider a closed function . Fix a real number and a nonempty set . Suppose that there is a set , and constants , , , and such that the inequalities

hold for all , , and . Then the inequalities

hold for all where we set .

Proof.

Consider a point . Suppose first . Then we deduce , as claimed. Hence we may suppose there exists a subgradient . We deduce

Choosing , such that and attain and , respectively, we deduce . The result follows. ∎

4 Illustration: convex composite minimization

In this section, we briefly illustrate the results of the previous section in the context of composite minimization, and recall some consequences already derived in [17] from preliminary versions of the material presented in the current paper. This section will not be used in the rest of the paper, and so the reader can safely skip it if needed.

The notation and much of discussion follows that set out in [17]. Consider the minimization problem

where is a closed convex function, is a finite-valued -Lipschitz convex function, and is a -smooth map with the Jacobian that is -Lipschitz continuous. Define the model function

One can readily verify the inequality

In particular, the models are “Taylor-like”. The prox-linear algorithm iterates the steps

(4.1)

The following is a rudimentary convergence guarantee of the scheme [17, Section 5]:

(4.2)

where is the limit of the decreasing sequence . In particular, the step-sizes tend to zero. Moreover, one can readily verify that for any limit point of the iterate sequence , equality holds. Consequently, by Corollary 3.3, the point is stationary for :

We note that stationarity of the limit point is well-known and can be proved by other means; see for example the discussion in [10]. From (4.2), one also concludes the rate

What is the relationship of this rate to near-stationary of the iterate ? Corollary 3.2 shows that after iterations, one is guaranteed to find an iterate such that there exists a point satisfying

Let us now move on to linear rates of convergence. Fix a limit point of the iterate sequence and let be the set of stationary points of . Then Theorem 3.5 shows that the regularity condition

  • (Slope error-bound) .

implies

  • (Step-size error-bound)

Additionally, in the next section (Corollary 5.7) we will show that the slope error-bound also implies

  • (Model error-bound)

    whenever and lies in .

It was, in fact, proved in [17, Theorem 5.10] that the slope and step-size error bounds are equivalent up to a change of constants. Moreover, as advertised in the introduction, the above implications were used in [17, Theorem 5.5] to show that if the slope error-bound holds then the function values converge linearly:

where

The rate improves to under a stronger regularity condition, called tilt-stability [17, Theorem 6.3]; the argument of the better rate again crucially employs a comparison of step-lengths and subgradients at near-by points.

Our underlying assumption is that the models are easy to minimize, by an interior point method for example. This assumption may not be realistic in some large-scale applications. Instead, one must solve the subproblems (4.1) inexactly by a first-order method. How should one choose the accuracy?

Fix a sequence of tolerances and suppose that each iterate satisfies

Then one can establish the estimate

where is the true minimizer of . The details will appear in a forthcoming paper. Solving (4.1) to accuracy can be done in multiple ways using a saddle point reformulation. We treat the simplest case here, where is a support function of a closed bounded set – a common setting in applications. We can then write

Dual to the problem is the maximization problem

For such problems, there are primal-dual methods [34, Method 2] that generate iterates and satisfying

after iterations. The cost of each iteration is dominated by a matrix vector multiplication, a projection onto , and a proximal operation of . Assuming that and are bounded throughout the algorithm, to obtain an accurate solution to the subproblem (4.1) requires on the order of such basic operations. Setting for some , we deduce . Thus we can find a point satisfying after at most on the order of