Inexact Successive Quadratic Approximation for Regularized OptimizationVersion of August 9, 2019.

# Inexact Successive Quadratic Approximation for Regularized Optimization1

## Abstract

Successive quadratic approximations, or second-order proximal methods, are useful for minimizing functions that are a sum of a smooth part and a convex, possibly nonsmooth part that promotes a regularized solution. Most analyses of iteration complexity focus on the special case of proximal gradient method, or accelerated variants thereof. There have been only a few studies of methods that use a second-order approximation to the smooth part, due in part to the difficulty of obtaining closed-form solutions to the subproblems at each iteration. In practice, iterative algorithms need to be used to find inexact solutions to the subproblems. In this work, we present global analysis of the iteration complexity of inexact successive quadratic approximation methods, showing that it is sufficient to obtain an inexact solution of the subproblem to fixed multiplicative precision in order to guarantee the same order of convergence rate as the exact version, with complexity related proportionally to the degree of inexactness. Our result allows flexible choices of the second-order terms, including Newton and quasi-Newton choices, and does not necessarily require more time to be spent on the subproblem solves on later iterations. For problems exhibiting a property related to strong convexity, the algorithm converges at a global linear rate. For general convex problems, the convergence rate is linear in early stages, while the overall rate is . For nonconvex problems, a first-order optimality criterion converges to zero at a rate of .

###### Keywords:
Convex optimization Nonconvex optimization Regularized optimization Composite optimization Variable metric Proximal method Second-order approximation Inexact method
4

## 1 Introduction

We consider the following regularized optimization problem:

 minxF(x):=f(x)+g(x), (1)

where is -Lipschitz-continuously differentiable, and is convex, extended-valued, proper, and closed, but might be nondifferentiable. Moreover, we assume that is lower-bounded and the solution set of (1) is non-empty. Unlike the many other works on this topic, we focus on the case in which does not necessarily have a simple structure, such as (block) separability, which allows a prox-operator to be calculated in closed form (or at least economically). Rather, we assume that subproblems involving are solved inexactly, by an iterative process.

Problems of the form (1) arise in many contexts. The function could be an indicator function for a trust region or a convex feasible set. It could be a multiple of an norm or a sum-of- norms. It could be the nuclear norm for a matrix variable, or the sum of absolute values of the elements of a matrix. It could be a smooth convex function, such as or the squared Frobenius norm of a matrix. Finally, it could be a combination of several of these elements, as happens when different types of structure are present in the solution. In several of these situations, the prox-operator involving is expensive to calculate exactly.

We propose algorithms that generate a sequence from some starting point , and solve the following subproblem at each iteration, for some symmetric matrix :

 argmind∈RnQxkHk(d):=∇f(xk)Td+12dTHkd+g(xk+d)−g(xk). (2)

We abbreviate the objective in (2) as , or as when we focus on the inner workings of iteration . In some results, we allow to have zero or negative eigenvalues, provided that itself is convex. (In some cases, strong convexity in may overcome lack of strong convexity in the quadratic part of (2).)

In the special case of the proximal-gradient algorithm (ComW05a, ; WriNF08a, ), where is a positive multiple of the identity, the subproblem (2) can often be solved cheaply, particularly when is (block) separable, by means of a prox-operator involving . For more general choices of , or for more complicated regularization functions , it may make sense to solve (2) by an iterative process, such as accelerated proximal gradient or coordinate descent. Since it may be too expensive to run this iterative process to obtain a high-accuracy solution of (2), we consider the possibility of an inexact solution. In this paper, we assume that the inexact solution satisfies the following condition, for some constant :

 Q(d)−Q∗≤η(Q(0)−Q∗)⇔Q(d)≤(1−η)Q∗, (3)

where . The choice corresponds to exact solution of (2). Other choices of ensure inexact solutions to within a multiplicative constant.

The condition (3) is studied in (BonLPP16a, , Section 4.1), which apply a primal-dual approach to (2) to satisfy it. In this vein, if we have access to a lower bound (obtained by finding a feasible point for the dual of (2), or other means), then any inexact solution satisfying also satisfies (3).

In practical situations, we can often be sure that our approximate solution satisfies (3) for some even though we might not know the value of . For instance, if is strongly convex and we apply an iterative solver that converges at a global linear rate , then the “inner” iteration sequence (starting with ) for solving (2) satisfies

 Q(d(t))−Q∗≤(1−τ)t(Q(0)−Q∗),t=0,1,2,…,

If we fix the number of inner iterations at (say), then satisfies (3) with . On the other hand, if we wish to attain a certain target accuracy and have an estimate of the convergence rate , we can choose the number of iterations large enough that . Note that depends on the extreme eigenvalues of in many algorithms; we can therefore choose to ensure that is restricted to a certain range for all .

Empirically, we observe that Q-linear methods for solving (2) often have rapid convergence in their early stages, with slower convergence later. (We find theoretical support for this observation in Theorems 3.3 and 3.4.) This observation suggests that a moderate value of may be preferable to a smaller value, because moderate accuracy is attainable in disproportionately fewer iterations than high accuracy.

In this paper, we describe algorithms based on inexact solutions of the subproblem (2) with accuracy (3) for a fixed choice of . We examine in particular the number of outer iterations (measured by the index ) required to solve (1) to a given accuracy . We show that the effect of inexact solutions on the iteration complexity is benign, that is, the number of iterations increases by a modest factor (which depends of course on ) over approaches that require exact solution of (2) for each .

### 1.1 The Algorithms

To build complete algorithms around the subproblem (2), we either do a backtracking line search along the inexact solution , or adjust and recompute , seeking in both cases to satisfy a familiar “sufficient decrease” criterion.

We present three such algorithms. The first uses a backtracking line search approach with a modified Armijo rule, an approach presented in TseY09a (). Given the current point , the update direction and parameters , this procedure finds the smallest nonnegative integer such that the step size satisfies

 F(xk+αkdk)≤F(xk)+αkγΔk, (4)

where

 Δk:=∇f(xk)Tdk+g(xk+dk)−g(xk). (5)

This version appears as Algorithm 1. The exact version of this algorithm can be considered as a special case of the block-coordinate descent algorithm of TseY09a ().5 In BonLPP16a (), Algorithm 1 (with possibly a different criterion on ) is called the “variable metric inexact line-search-based method”. We avoid the term “metric” because we consider the possibility of indefinite in some of our results. The paper BonLPP16a () also considers more complicated metrics, not representable by a matrix norm. Since our analysis makes use only of the smallest and largest eigenvalues of (which directly correspond to the strong convexity and Lipschitz continuity parameters of the quadratic approximation term), we could also generalize our approach to this setting. We present only the matrix-representable case, however, as it allows a more direct comparison with the second and third algorithms presented below.

The second and third algorithms use the following sufficient decrease criterion:

 F(x)−F(x+d)≥−γQxH(d), (6)

for given parameter . If this criterion is not satisfied, the second algorithm multiplies by a constant (where again is a given parameter) and recomputes . We assume in this algorithm that the initial choice of is positive definite, so that all eigenvalues are positive and grow successively by a factor of until sufficient decrease is achieved. The third algorithm uses a similar strategy, except that is modified by adding a successively larger multiple of the identity to it whenever the criterion (6) fails to hold. (This algorithm requires only positive semidefiniteness of the initial estimate of .) These two approaches are defined as Algorithm 2 and 3, respectively.

Algorithm 3 is similar to the method proposed in SchT16a (); GhaS16a (), and can be seen as interpolating between the step from the original and the proximal gradient step. Rather than our multiplicative criterion (3), the works SchT16a (); GhaS16a () use an additive criterion to measure inexactness of the solution. This tolerance must then be reduced to zero at a certain rate as the algorithm progresses, resulting in growth of the number of inner iterations per outer iteration as the algorithms progress in the analysis of SchT16a (); GhaS16a (). By contrast, we attain satisfactory performance (both in theory and practice) for a fixed value in (3).

Algorithms 1 and 2 are direct extensions of backtracking line search in the smooth case, while Algorithm 3 is related to the trust-region approach. Which of these three algorithms is “best”? The answer depends on the circumstances. When (2) is expensive to solve, Algorithm 1 may make sense, as it requires such a solution just once on each outer iteration.

Variants and special cases of the algorithms above have been discussed extensively in the literature. Proximal gradient algorithms have for some (ComW05a, ; WriNF08a, ); proximal-Newton uses (LeeSS14a, ; RodK16a, ; LiAV17a, ); proximal-quasi-Newton and variable metric use quasi-Newton approximations for (SchT16a, ; GhaS16a, ; ChoPR14a, ; BonLPP16a, ; BonLPPR17a, ). The term “successive quadratic approximation” is also used by ByrNO16a (). Our methods can even be viewed as a special case of block-coordinate descent (TseY09a, ) with a single block. The key difference in this work is the use of the inexactness criterion (3), while existing works either assume exact solution of (2), or use a different criterion that requires increasing accuracy as the number of outer iterations grows. Some of these works provide only an asymptotic convergence guarantee and a local convergence rate, with a lack of clarity about when the fast local convergence rate will take effect. An exception is BonLPP16a (), in which the criterion (3) is called -approximation (where their is equivalent to our ). However, this paper gives convergence rate analysis only for convex and requires existence of a scalar and a sequence such that

 μ≥1,∞∑k=0ζk<∞,ζk≥0,Hk+1⪯(1+ζk)Hk,μI⪰Hk⪰1μI,∀k. (7)

This condition may preclude such useful and practical choices of as the Hessian and quasi-Newton approximations. We believe that our setting may be more general, practical, and straightforward in some situations.

### 1.2 Contribution

This paper shows that, when the initial value of at all outer iterations is chosen appropriately, and that (3) is satisfied for all iterations, then the objectives of the three algorithms converge at a global Q-linear rate under an “optimal set strong convexity” condition defined in (10), and at a sublinear rate for general convex functions. When is nonconvex, we show sublinear convergence of the first-order optimality condition. Moreover, to discuss the relation between the subproblem solution precision and the convergence rate, we show that the iteration complexity is proportional to for Algorithms 2 and 3, and proportional to or for Algorithm 1.6

In comparison to existing works, our major contributions are as follows.

• We quantify how the inexactness criterion (3) affects the step size of Algorithm 1 and the overall iteration complexity of all our algorithms. We discuss why line search can potentially improve the convergence speed with properly selected quadratic approximations.7

• We provide a global convergence rate result to a first-order optimality condition for the case of nonconvex in (1) for general choices of , without additional assumptions other than the Lipschitzness of .

• The global R-linear convergence case of a similar algorithm in GhaS16a () when is strongly convex is improved to a global Q-linear convergence result for a broader class of problems.

• For general convex problems, in addition to the known sublinear () convergence rate, we show linear convergence with a rate independent of the conditioning of the problem in the early stages of the algorithm.

• Faster linear convergence in the early iterations also applies to problems with global Q-linear convergence, explaining the well-known observation that many methods converge rapidly in their early stages before settling down to a slower rate later.

• The proximal-gradient method is a special case of our approach, and it reduces to steepest-descent on smooth functions when is not present. We show that our analysis matches known convergence results in these settings. These results are improved, in that “early linear convergence” for the special cases has not been discussed (to our knowledge), and the convergence rate we obtain for the nonconvex case is sharper than existing results for proximal-gradient.

### 1.3 Related Work

Our general framework and approach, and special cases thereof, have been widely studied in the literature. We discuss some of these works and their connections to our paper.

When is the indicator function of a convex constraint set, our approach includes an inexact variant of a constrained Newton or quasi-Newton method. There are a number of papers on this approach, but their convergence results generally have a different flavor from ours. They typically show only asymptotic convergence rates, together with global convergence results without rates, under weaker smoothness and convexity assumptions on than we make here. For example, when is the indicator function of a “box” defined by bound constraints, ConGT88a () applies a trust-region framework to solve (2) approximately, and shows asymptotic convergence. The paper ByrLNZ95a () uses a line-search approach, with defined by an L-BFGS update, and omits convergence results. For constraint sets defined by linear inequalities, or general convex constraints, BurMT90a () shows global convergence of a trust region method using the Cauchy point. A similar approach using the exact Hessian as is considered in LinM99a (), proving local superlinear or quadratic convergence in the case of linear constraints.

Turning to our formulation (1) in its full generality, Algorithm 1 is analyzed in BonLPP16a (), who refer to the condition (3) as “-approximation.” (Their is equivalent to in our notation.) This paper shows asymptotic convergence of to zero without requiring convexity of , Lipschitz continuity of , or a fixed value of . The only assumptions are that for all and the objective converges to a point (which always happens when is bounded below). Under the additional assumptions that is Lipschitz continuous, is convex, (7), and (3), they showed an convergence of the objective value. The same authors considered convergence for nonconvex functions satisfying a Kurdyka-Łojasiewicz condition in BonLPPR17a (), but the exact rates are not given. Our result differ in not requiring the assumption (7), and we are more explicit about the dependence of the rates on . Moreover, we show detailed convergence rates for several different classes of problems.

A version of Algorithm 2 without line search but directly assuming

 F(xk+dk)≤F(xk)+Qk(dk)for all k (8)

is considered in ChoPR14a (). They showed asymptotic convergence, but no rates were given.

Convergence of an inexact proximal-gradient method (for which for all ) is discussed in SchRB11a (). They also discuss its accelerated version for convex and strongly convex problems. Because of this choice of , (8) always holds. Instead of our multiplicative inexactness criterion, they assume an additive error of the form

 Qk(dk)≤Q∗k+ϵk. (9)

Their analysis allows for error in the gradient term in (2). They show that for general convex problems, the objective value converges in an rate under the assumption that and converge. For strongly convex problems, they proved R-linear convergence of , provided that both and decrease linearly to zero. Our analysis, when specialized to proximal gradient and the strongly convex case, shows a Q-linear rate (rather than R-linear) and applies to the convergence of the objective value rather than the iterate.

Algorithm 3 is proposed in SchT16a (); GhaS16a () for convex and strongly convex objective, with inexactness defined additively as in (9). For convex , SchT16a () showed that if and converge then an convergence rate is achievable. The same rate can be achieved if for any . When is -strongly convex, GhaS16a () showed that if is finite (where , is the upper bound for , and is as defined in (6)), then a global R-linear convergence rate is attained. In both cases, the conditions require that decreases at least at a certain speed and, according to their analysis, this tolerance can be achieved by performing more and more inner iterations as increases. As we have noted, our multiplicative criterion can be attained with a fixed number of inner iterations. Moreover, we attain a Q-linear rather than an R-linear result.

Algorithm 1 is also considered in LeeSS14a (), with set either to or a BFGS approximation. Global convergence and a local convergence rate are shown for the exact case. For inexact subproblem solutions, local results are proved under the assumption that the unit step size is always taken (which may not hold true for inexact steps without further precautions being taken). A variant of Algorithm 1 with a different step size criterion is discussed in ByrNO16a (), for the special case of . Inexactness of the subproblem solution is measured by the norm of a proximal-gradient step for . By utilizing specific properties of the norm, they showed a global convergence rate on the norm of the proximal gradient step on to zero, without requiring convexity of . Thus, their result is similar to ours for the nonconvex case. However, their result cannot be extended directly to the case of general , and our inexactness condition does not require the additional cost of computing the proximal gradient step on . When is the Hessian or the BFGS approximation, they obtain for the inexact version local convergence results similar to the exact case proved in LeeSS14a ().

For the case in which is convex, thrice continuously differentiable, and self-concordant, and is the indicator function of a closed convex set, TraKC14a () analyzed global and local convergence rates of inexact damped proximal Newton with a fixed step size. LiAV17a () further extends the convergence analysis to general convex . However, it does not seem possible to generalize these results to general and non-self-concordant .

### 1.4 Outline

The remainder of this paper is organized as follows. In Section 2 we introduce notations and preliminaries for further analysis. Convergence analysis appears in Section 3 for Algorithms 1, 2, and 3, covering both convex and nonconvex problems. Some interesting and practical choices of are discussed in Section 4. We provide preliminary numerical results in Section 5. Some final comments appear in Section 6.

## 2 Notations and Preliminaries

The norm , when applied on vectors, denotes the Euclidean norm, whereas when applied on a symmetric matrix , it denotes the operator norm, which is equivalent to the spectral radius of . For any symmetric matrix , denotes its smallest eigenvalue. For any two symmetric matrices and , (respectively ) denotes that is positive semidefinite (respectively positive definite). For any non-smooth function , denotes the set of its directional derivatives, and when is convex, this is the same as the set of the subdifferential. When the minimum of is attainable, we denote the solution set by

 Ω:={x∣F(x)=F∗},

and define as the (Euclidean-norm) projection of onto .

In some results, we use a particular strong convexity assumption to obtain a faster rate. We say that satisfies the optimal set strong convexity condition if there exists such that for any and any , we have

 F(λx+ (1−λ)PΩ(x)) ≤λF(x)+(1−λ)F∗−μλ(1−λ)2∥x−PΩ(x)∥2. (10)

This is a weaker condition than -strong convexity. We do not require the strong convexity to hold globally, but only between the current point and its projection onto the solution set. Some examples of functions that are not strongly convex but satisfy (10) include:

• where is strongly convex, and is any matrix;

• , where is a polyhedron;

• Squared-hinge loss: .

(Arguments to show that these functions satisfy (10) are similar to those in LiuW15b () and hence are omitted here.)

Turning to the subproblem (2) and the definition of in (5), we find a condition for to be a descent direction.

###### Lemma 1

is a descent direction for at if .

###### Proof

From the convexity of , the Lipschitz continuity of , and the mean value theorem that for all , we have

 F(x)−F(x+λd) = f(x)−f(x+λd)+g(x)−g(x+λd) ≥ f(x)−f(x+λd)+g(x)−((1−λ)g(x)+λg(x+d)) = −λ∇f(x)Td+λ(g(x)−g(x+d))−λ∫10(∇f(x+tλd)−∇f(x))Tddt ≥ −λΔ−12λ2L∥d∥2. (11)

Therefore, if , when is small enough, will be larger than zero, indicating is a descent direction.

The following lemma motivates our algorithms.

###### Lemma 2

If is convex, then any that makes is a descent direction for at .

###### Proof

Note that . Therefore, if is convex, we have

 λ∇f(x)Td+λ22dTHd+g(x+λd)−g(x)=QxH(λd)≤λQxH(d)<0,

for all . It follows that for all sufficiently small . Therefore, from Lemma 1, is a descent direction, and since and only differ in their lengths, so is .

To ensure the convexity of , we need only positive semidefiniteness of . However, Lemma 2 can be applied even when has negative eigenvalues, as may have a strong convexity property than ensures convexity of . Lemma 2 then suggests that no matter how coarse is the approximate solution of (2), as long as it is better than for a convex , it results in a descent direction. This fact implies finite termination of the subroutine of backtracking line search on the step size in Algorithm 1.

## 3 Convergence Analysis

We start our analysis for all the three algorithms by showing finite termination of the line search procedures. We then discuss separately three classes of problems involving different assumptions on , namely, that satisfies optimal set strong convexity (10), that is convex, and that is nonconvex. Different iteration complexities are proved in each case.

### 3.1 Line Search Iteration Bound

We show that the line search procedures have finite termination. The following lemma for the backtracking line search in Algorithm 1 does not require to be positive definite, though it does require strong convexity in defined in (2).

###### Lemma 3

If is -strongly convex for some , and (2) is solved at least -approximately for some , then for defined in (5), we have

 Δ≤−12((1−√η)σ1+√η+λmin(H))∥d∥2. (12)

Moreover, if , then the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size that satisfies the following lower bound:

 α≥min{1,β(1−γ)((1−√η)σ+(1+√η)λmin(H))L(1+√η)}. (13)
###### Proof

From (3) and strong convexity of , we have that for any ,

 11−η(Q(0)−Q(d)) ≥Q(0)−Q∗ ≥Q(0)−Q(λd) ≥Q(0)−(λQ(d)+(1−λ)Q(0)−σλ(1−λ)2∥d∥2).

Since , we obtain by substituting from the definition of that

 11−η(f(x)Td+12dTHd+g(x+d)−g(x)) ≤λ(∇f(x)Td+12dTHd+g(x+d)−g(x))−σλ(1−λ)2∥d∥2.

Since , we have

 (11−η−λ)Δ (14)

so that

 Δ≤−12⎛⎜ ⎜⎝σλ(1−λ)(11−η−λ)+λmin(H)⎞⎟ ⎟⎠∥d∥2. (15)

This bound holds for any , so we make the following specific choice of in this range:

 λ=1−√η1−η.

For this value of , we have

 1−λ=√ηλ,11−η−λ=√η1−η.

The result (12) follows by substituting these identities into (15).

From the Lipschitz continuity of and the convexity of , if the right-hand side of (12) is negative, we have from (11) that for any scalar and the above,

 F(x+αd)−F(x)≤αΔ+Lα22∥d∥2≤αΔ−Lα2(1+√η)(1−√η)σ+(1+√η)λmin(H)Δ.

Therefore, (4) is satisfied if

 αΔ−Lα2(1+√η)(1−√η)σ+(1+√η)λmin(H)Δ≤αγΔ.

We thus get that (4) holds whenever

 α≤(1−γ)((1−√η)σ+(1+√η)λmin(H))L(1+√η).

This leads to (13), when we introduce a factor to account for possible undershoot of the backtracking procedure.

Lemma 3 suggests that we can still obtain a certain amount of objective decrease as long as is not too negative in comparison to the strong convexity parameter of . However, our main interest is in the case in which is positive semidefinite. When the strong convexity of is completely due to the Hessian of the quadratic part, (that is, ), we obtain the following simpler version of Lemma 3.

###### Corollary 1

If , and (2) is solved at least -approximately for some , then we have

 Δ≤−σ1+√η∥d∥2. (16)

Moreover, the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size that satisfies the following lower bound:

 α≥¯α:=min{1,2β(1−γ)σL(1+√η)}. (17)

One can move the strong convexity between the quadratic term and easily by adding a multiple of to one term and subtracting the same value from the other term. By doing so judiciously, we could always ensure that , so that Corollary 1 can be applied. This convexity shifting will not change the solution of (2), but it will change the value of . We keep both results Lemma 3 and Corollary 1, since convexity-shifting will increase the largest eigenvalue of and might degrade the overall iteration complexity that we consider in the next section.

Next we consider Algorithms 2 and 3.

###### Lemma 4

If we have at each iteration of Algorithms 2 and 3, then the inner loops in these algorithms terminate finitely. Moreover, if the initial satisfies

 m0I⪯H0k⪯M0I

for some and with (with for Algorithm 2), then the final satisfies

 ∥Hk∥≤~M1:=max{M0,M0L/(βγm0)} (18)

for Algorithm 2 and

 ∥Hk∥≤~M2:=M0+max{1,(L/γ−m0)/β} (19)

for Algorithm 3.

###### Proof

From the Lipschitz continuity of , we have that

 F(x)−F(x+d)+γQxH(d) = f(x)−f(x+d)+γ∇f(x)Td+γ2dTHd+(1−γ)(g(x)−g(x+d)) ≥

Since and , we have

 −∇f(x)Td+g(x)−g(x+d)≥0.

Therefore, as long as , which is implied by , the sufficient decrease condition (6) is satisfied, so the line search procedure terminates finitely. This observation suggests that in Algorithm 2 the smallest eigenvalue of the final is no larger than , and since the proportion between the largest and the smallest eigenvalues of remains unchanged after scaling the whole matrix, we obtain (18).

In Algorithm 3, to satisfy , the coefficient of must be at least . Considering the overshoot, and that the difference between the largest and the smallest eigenvalues is fixed after adding a multiple of identity, we obtain the condition (19).

For Algorithms 2 and 3, the precision of the subproblem solution does not affect the line search procedure provided the solution is better than the initial point. That is, does not appear in Lemma 4. The precision only affects the final convergence rate. However, such is not the case for Algorithm 1 as we see from the appearance of in Lemma 3 and Corollary 1.

### 3.2 Iteration Complexity

Now we turn to show the iteration complexity of our algorithms, considering three different assumptions on : optimal set strong convexity, convexity, and the general (possibly nonconvex) case.

The following lemma is modified from some intermediate results in GhaS16a (), where they show R-linear convergence of Algorithm 3 for strongly convex objective when the inexactness is measured by an additive criterion. A proof can be found in the appendix.

###### Lemma 5

Let be the optimum of . If is convex and is -optimal-set-strongly convex as defined in (10) for some , then for any given and , we have

 Q∗≤ λ(F∗−F(x))−μλ(1−λ)2∥x−PΩ(x)∥2 (20) +∥H∥λ22∥x−PΩ(x)∥2,∀λ∈[0,1],

where is the optimal objective value of (2). In particular, by setting (as in SchT16a ()), we have

 Q∗≤μμ+∥H∥(F∗−F(x)). (21)

Note that we allow in Lemma 5.

#### Linear Convergence for Optimal Set Strongly Convex Functions

We start with the case that the problem is -optimal-set-strongly convex with some , and show that our algorithms converge linearly.

###### Theorem 3.1

Assume that is convex, is -optimal-set-strongly convex for some , at every iteration of Algorithm 1, the approximate solution of (2) is at least -approximate for some , and for some and all . Then we have

 F(xk+1)−F∗≤(1−αkγ(1−η)(μμ+M))(F(xk)−F∗),k=0,1,2,…. (22)

We therefore have the following.

1. If for some for all , then the iteration complexity to obtain an -accurate solution is

 O(max{1(1−η),L2(1−√η)β(1−γ)σ}μ+Mμγlog1ϵ). (23)
2. If the choice of makes at least -strongly convex for some and is positive semidefinite for all , then the iteration complexity to obtain an -accurate solution is

 O⎛⎝max⎧⎨⎩1(1−η),L(1−√η)2β(1−γ)σ⎫⎬⎭μ+Mμγlog1ϵ⎞⎠. (24)
###### Proof

Given any iterate , the sufficient decrease condition (4) and positive semidefiniteness of imply that

 F(xk+1)−F(xk)≤αkγΔk=αkγ(Qk(dk)−12(dk)THkdk)≤αkγQk(d). (25)

By rearranging (25), subtracting from both sides, and applying (21), we have

 F(xk+1)−F∗ ≤F(xk)−F∗+αkγQk(dk) (26) (27) ≤(1−αkγ(1−η)μμ+M)(F(xk)−F∗),

where in (26) we used the inexactness condition (3) and in (27) we used (21). Using the results in Lemma 3 and Corollary 1, we obtain the lower bound for in the two scenarios, and the results (23) and (24).

We have a similar result for Algorithms 2 and 3.

###### Theorem 3.2

Assume that is convex, is -optimal-set-strongly convex for some , at every iteration of Algorithms 2 and 3, the approximate solution of (2) is at least -approximate for some , and the conditions in Lemma 4 are satisfied for all . Then we have

 F(xk+1)−F∗≤(1−γμμ+∥Hk∥(1−η))(F(xk)−F∗),k=0,1,2,…, (28)

and thus the iteration complexities for obtaining an -accurate solution are

 O(μ+~M1μγ(1−η)log1ϵ) and O(μ+~M2μγ(1−η)log1ϵ), (29)

respectively, for Algorithms 2 and 3, where and are defined in Lemma 4.

###### Proof

From (21) and (6), we have

 F(xk+1)−F∗ =F(xk+1)−F(xk)+F(xk)−F∗ ≤γQk(dk)+F(xk)−F∗ ≤γ(1−η)Q∗k+F(xk)−F∗ ≤(1−γμμ+∥Hk∥(1−η))(F(xk)−F∗),

which is exactly (28). Now from Lemma 4, we ensure that is upper-bounded by some over iterations. This bound thus shows the desired Q-linear convergence, and thus the iteration complexity is

 O(μ+~Mμγ(1−η)log1ϵ).

By further substituting from (18) and (19), we obtain (29).

The step size guarantees in Lemma 3 and Corollary 1 are just lower bounds. By selecting suitably, the line search will generally terminate with a value of not far from to , yielding complexity much better than the worst case suggested by Theorem 3.1. A similar argument applies to (28). When is chosen properly, the convergence rates in Theorem 3.2 can be less dependent on the condition number. The use of Newton and quasi-Newton approximations for often have these desirable properties.

Second, to compare (29), (23), and (24), we note that for any ,

 12(1−√η)<11−η≤11−√η<1(1−√η)2.

Therefore, Algorithm 1 with positive definite has the best dependency on , Algorithms 2 and 3 are next, and Algorithm 1 with strongly convex and has the worst bound in terms of dependency on . On the other hand, if we move the strong convexity parameter from to , the value of is only affected by an additive factor. Therefore, it is clear that using a positive definite in Algorithm 1 can improve convergence bounds significantly.

#### Sublinear Convergence for General Convex Problems

We now consider the general convex case. We assume that the level sets of are bounded, and define

 R0:=supx:F(x)≤F(x0)∥x−PΩ(x)∥.

Note that boundedness of the level set guarantees that is finite. Using this definition, when , (20) can be rewritten as

 Q∗≤λ(F∗−F(x))+∥H∥λ22R20,for all λ∈[0,1],

for any with . If for some and all , then we have

 Q∗≤λ(F∗−F(x))+Mλ22R20,for all λ∈[0,1]. (30)

The following lemma is inspired by (Bac15a, , Lemma 4.4) but contains nontrivial modifications, and will be needed in proving the convergence rate for general convex problems. Its proof can be found in the appendix.

###### Lemma 6

Assume we have two non-negative sequences and , and constants such that

 ∀t≥0,λt∈[0,1],δt+1≤δt+c(−λtδt+A2λ2t). (31)

If

 λt=argminλ∈[0,1]−λδt+A2λ2, (32)

then we have if ,

 δt+1≤(1−c2)δt≤δt−cA2, (33)

 δt≤2(A+δ0)ct,\rm for all t≥0. (34)

By Lemma 6 together with (30), we can show that the algorithms converge at a global sublinear rate (with a linear rate in the early stages) for convex problems. We start with Algorithm 1.

###### Theorem 3.3

Assume that is convex, at every iteration of Algorithm 1, is chosen such that

 MI⪰Hk⪰σI,for some M≥σ>0, (35)

and the inexact solution of (2) is at least -approximate for some . Let be defined as in Corollary 1. Then the following claims are true.

1. When , the convergence rate is Q-linear, that is,

 F(xk+1)−F∗ ≤ (1−(1−η)γ¯α2)(F(xk)−F∗) ≤ max{(1−(1−η)γ2),(1−(1−√η)β(1−γ)σγL)}(F(xk)−F∗). (36)
2. For any , the objective follows a sublinear convergence rate:

 F(xk)−F∗2 ≤MR20+F(x0)−F∗γk(1−η)¯α ≤MR20+F(x0)−F∗γkmax{11−η,L2β(1−√η)(1−γ)σ}.
###### Proof

From Corollary 1, we know that the step size is lower-bounded by the defined by (17), for all . From (3), (30), and (25) we obtain that

 F(xk+1)−F(xk)≤¯αγ(1−η)(λ(F∗−F(xk))+MR202λ2),∀λ∈[0,1], (37)

for all . Defining , we note that (37) satisfies (31) with

 c=¯αγ(1−η),A=MR20.

The results now follow directly from Lemma 6 and Corollary 1.

We have a similar result for Algorithms 2 and 3.

###### Theorem 3.4

Assume that is convex, at every iteration of Algorithms 2 and 3, the initial satisfies the conditions in Lemma 4, and the approximate solution of (2) is at least -approximate for some . Then the following claims hold.

1. When for Algorithm 2 and when for Algorithm 3 (where and are defined in Lemma 4), the convergence rate is Q-linear:

 F(xk+1)−F∗≤(1−(1−η)γ2)(F(xk)−F∗). (38)
2. For any , the objective follows a sublinear convergence rate

 F(xk)−F∗2≤~M1R20+F(x0)−F∗(1−η)γk, (39)

for Algorithm 2. The same bound holds for Algorithm 3, with replaced by .

###### Proof

From Lemma 4, we know that the final for all is upper-bounded by (for Algorithm 2) and (for Algorithm 3). From (30) and the sufficient decrease condition (6), we get that for any ,

 F(xk+1)−F(xk) ≤γ(1−η)Q∗k ≤γ(1−η)(λ(F∗−F(xk))+~MR202λ2),∀λ∈[0,1]. (40)

Defining