Inexact Successive Quadratic Approximation for Regularized Optimization
Successive quadratic approximations, or second-order proximal methods, are useful for minimizing functions that are a sum of a smooth part and a convex, possibly nonsmooth part that promotes a regularized solution. Most analyses of iteration complexity focus on the special case of proximal gradient method, or accelerated variants thereof. There have been only a few studies of methods that use a second-order approximation to the smooth part, due in part to the difficulty of obtaining closed-form solutions to the subproblems at each iteration. In practice, iterative algorithms need to be used to find inexact solutions to the subproblems. In this work, we present global analysis of the iteration complexity of inexact successive quadratic approximation methods, showing that it is sufficient to obtain an inexact solution of the subproblem to fixed multiplicative precision in order to guarantee the same order of convergence rate as the exact version, with complexity related proportionally to the degree of inexactness. Our result allows flexible choices of the second-order terms, including Newton and quasi-Newton choices, and does not necessarily require more time to be spent on the subproblem solves on later iterations. For problems exhibiting a property related to strong convexity, the algorithm converges at a global linear rate. For general convex problems, the convergence rate is linear in early stages, while the overall rate is . For nonconvex problems, a first-order optimality criterion converges to zero at a rate of .
Keywords:Convex optimization Nonconvex optimization Regularized optimization Composite optimization Variable metric Proximal method Second-order approximation Inexact method
We consider the following regularized optimization problem:
where is -Lipschitz-continuously differentiable, and is convex, extended-valued, proper, and closed, but might be nondifferentiable. Moreover, we assume that is lower-bounded and the solution set of (1) is non-empty. Unlike the many other works on this topic, we focus on the case in which does not necessarily have a simple structure, such as (block) separability, which allows a prox-operator to be calculated in closed form (or at least economically). Rather, we assume that subproblems involving are solved inexactly, by an iterative process.
Problems of the form (1) arise in many contexts. The function could be an indicator function for a trust region or a convex feasible set. It could be a multiple of an norm or a sum-of- norms. It could be the nuclear norm for a matrix variable, or the sum of absolute values of the elements of a matrix. It could be a smooth convex function, such as or the squared Frobenius norm of a matrix. Finally, it could be a combination of several of these elements, as happens when different types of structure are present in the solution. In several of these situations, the prox-operator involving is expensive to calculate exactly.
We propose algorithms that generate a sequence from some starting point , and solve the following subproblem at each iteration, for some symmetric matrix :
We abbreviate the objective in (2) as , or as when we focus on the inner workings of iteration . In some results, we allow to have zero or negative eigenvalues, provided that itself is convex. (In some cases, strong convexity in may overcome lack of strong convexity in the quadratic part of (2).)
In the special case of the proximal-gradient algorithm (ComW05a, ; WriNF08a, ), where is a positive multiple of the identity, the subproblem (2) can often be solved cheaply, particularly when is (block) separable, by means of a prox-operator involving . For more general choices of , or for more complicated regularization functions , it may make sense to solve (2) by an iterative process, such as accelerated proximal gradient or coordinate descent. Since it may be too expensive to run this iterative process to obtain a high-accuracy solution of (2), we consider the possibility of an inexact solution. In this paper, we assume that the inexact solution satisfies the following condition, for some constant :
where . The choice corresponds to exact solution of (2). Other choices of ensure inexact solutions to within a multiplicative constant.
The condition (3) is studied in (BonLPP16a, , Section 4.1), which apply a primal-dual approach to (2) to satisfy it. In this vein, if we have access to a lower bound (obtained by finding a feasible point for the dual of (2), or other means), then any inexact solution satisfying also satisfies (3).
In practical situations, we can often be sure that our approximate solution satisfies (3) for some even though we might not know the value of . For instance, if is strongly convex and we apply an iterative solver that converges at a global linear rate , then the “inner” iteration sequence (starting with ) for solving (2) satisfies
If we fix the number of inner iterations at (say), then satisfies (3) with . On the other hand, if we wish to attain a certain target accuracy and have an estimate of the convergence rate , we can choose the number of iterations large enough that . Note that depends on the extreme eigenvalues of in many algorithms; we can therefore choose to ensure that is restricted to a certain range for all .
Empirically, we observe that Q-linear methods for solving (2) often have rapid convergence in their early stages, with slower convergence later. (We find theoretical support for this observation in Theorems 3.3 and 3.4.) This observation suggests that a moderate value of may be preferable to a smaller value, because moderate accuracy is attainable in disproportionately fewer iterations than high accuracy.
In this paper, we describe algorithms based on inexact solutions of the subproblem (2) with accuracy (3) for a fixed choice of . We examine in particular the number of outer iterations (measured by the index ) required to solve (1) to a given accuracy . We show that the effect of inexact solutions on the iteration complexity is benign, that is, the number of iterations increases by a modest factor (which depends of course on ) over approaches that require exact solution of (2) for each .
1.1 The Algorithms
To build complete algorithms around the subproblem (2), we either do a backtracking line search along the inexact solution , or adjust and recompute , seeking in both cases to satisfy a familiar “sufficient decrease” criterion.
We present three such algorithms. The first uses a backtracking line search approach with a modified Armijo rule, an approach presented in TseY09a (). Given the current point , the update direction and parameters , this procedure finds the smallest nonnegative integer such that the step size satisfies
This version appears as Algorithm 1. The exact
version of this algorithm can be considered as a special case of the
block-coordinate descent algorithm of TseY09a ().
The second and third algorithms use the following sufficient decrease criterion:
for given parameter . If this criterion is not satisfied, the second algorithm multiplies by a constant (where again is a given parameter) and recomputes . We assume in this algorithm that the initial choice of is positive definite, so that all eigenvalues are positive and grow successively by a factor of until sufficient decrease is achieved. The third algorithm uses a similar strategy, except that is modified by adding a successively larger multiple of the identity to it whenever the criterion (6) fails to hold. (This algorithm requires only positive semidefiniteness of the initial estimate of .) These two approaches are defined as Algorithm 2 and 3, respectively.
Algorithm 3 is similar to the method proposed in SchT16a (); GhaS16a (), and can be seen as interpolating between the step from the original and the proximal gradient step. Rather than our multiplicative criterion (3), the works SchT16a (); GhaS16a () use an additive criterion to measure inexactness of the solution. This tolerance must then be reduced to zero at a certain rate as the algorithm progresses, resulting in growth of the number of inner iterations per outer iteration as the algorithms progress in the analysis of SchT16a (); GhaS16a (). By contrast, we attain satisfactory performance (both in theory and practice) for a fixed value in (3).
Algorithms 1 and 2 are direct extensions of backtracking line search in the smooth case, while Algorithm 3 is related to the trust-region approach. Which of these three algorithms is “best”? The answer depends on the circumstances. When (2) is expensive to solve, Algorithm 1 may make sense, as it requires such a solution just once on each outer iteration.
Variants and special cases of the algorithms above have been discussed extensively in the literature. Proximal gradient algorithms have for some (ComW05a, ; WriNF08a, ); proximal-Newton uses (LeeSS14a, ; RodK16a, ; LiAV17a, ); proximal-quasi-Newton and variable metric use quasi-Newton approximations for (SchT16a, ; GhaS16a, ; ChoPR14a, ; BonLPP16a, ; BonLPPR17a, ). The term “successive quadratic approximation” is also used by ByrNO16a (). Our methods can even be viewed as a special case of block-coordinate descent (TseY09a, ) with a single block. The key difference in this work is the use of the inexactness criterion (3), while existing works either assume exact solution of (2), or use a different criterion that requires increasing accuracy as the number of outer iterations grows. Some of these works provide only an asymptotic convergence guarantee and a local convergence rate, with a lack of clarity about when the fast local convergence rate will take effect. An exception is BonLPP16a (), in which the criterion (3) is called -approximation (where their is equivalent to our ). However, this paper gives convergence rate analysis only for convex and requires existence of a scalar and a sequence such that
This condition may preclude such useful and practical choices of as the Hessian and quasi-Newton approximations. We believe that our setting may be more general, practical, and straightforward in some situations.
This paper shows that, when the initial value of at all outer
iterations is chosen appropriately, and that (3) is
satisfied for all iterations, then the objectives of the three algorithms
converge at a global Q-linear rate under an “optimal set strong
convexity” condition defined in (10), and at a sublinear
rate for general convex functions. When is nonconvex, we show
sublinear convergence of the first-order optimality condition.
Moreover, to discuss the relation between the subproblem solution
precision and the convergence rate, we show that the iteration
complexity is proportional to for
Algorithms 2 and 3, and
proportional to or for
In comparison to existing works, our major contributions are as follows.
We provide a global convergence rate result to a first-order optimality condition for the case of nonconvex in (1) for general choices of , without additional assumptions other than the Lipschitzness of .
The global R-linear convergence case of a similar algorithm in GhaS16a () when is strongly convex is improved to a global Q-linear convergence result for a broader class of problems.
For general convex problems, in addition to the known sublinear () convergence rate, we show linear convergence with a rate independent of the conditioning of the problem in the early stages of the algorithm.
Faster linear convergence in the early iterations also applies to problems with global Q-linear convergence, explaining the well-known observation that many methods converge rapidly in their early stages before settling down to a slower rate later.
The proximal-gradient method is a special case of our approach, and it reduces to steepest-descent on smooth functions when is not present. We show that our analysis matches known convergence results in these settings. These results are improved, in that “early linear convergence” for the special cases has not been discussed (to our knowledge), and the convergence rate we obtain for the nonconvex case is sharper than existing results for proximal-gradient.
1.3 Related Work
Our general framework and approach, and special cases thereof, have been widely studied in the literature. We discuss some of these works and their connections to our paper.
When is the indicator function of a convex constraint set, our approach includes an inexact variant of a constrained Newton or quasi-Newton method. There are a number of papers on this approach, but their convergence results generally have a different flavor from ours. They typically show only asymptotic convergence rates, together with global convergence results without rates, under weaker smoothness and convexity assumptions on than we make here. For example, when is the indicator function of a “box” defined by bound constraints, ConGT88a () applies a trust-region framework to solve (2) approximately, and shows asymptotic convergence. The paper ByrLNZ95a () uses a line-search approach, with defined by an L-BFGS update, and omits convergence results. For constraint sets defined by linear inequalities, or general convex constraints, BurMT90a () shows global convergence of a trust region method using the Cauchy point. A similar approach using the exact Hessian as is considered in LinM99a (), proving local superlinear or quadratic convergence in the case of linear constraints.
Turning to our formulation (1) in its full generality, Algorithm 1 is analyzed in BonLPP16a (), who refer to the condition (3) as “-approximation.” (Their is equivalent to in our notation.) This paper shows asymptotic convergence of to zero without requiring convexity of , Lipschitz continuity of , or a fixed value of . The only assumptions are that for all and the objective converges to a point (which always happens when is bounded below). Under the additional assumptions that is Lipschitz continuous, is convex, (7), and (3), they showed an convergence of the objective value. The same authors considered convergence for nonconvex functions satisfying a Kurdyka-Łojasiewicz condition in BonLPPR17a (), but the exact rates are not given. Our result differ in not requiring the assumption (7), and we are more explicit about the dependence of the rates on . Moreover, we show detailed convergence rates for several different classes of problems.
A version of Algorithm 2 without line search but directly assuming
is considered in ChoPR14a (). They showed asymptotic convergence, but no rates were given.
Convergence of an inexact proximal-gradient method (for which for all ) is discussed in SchRB11a (). They also discuss its accelerated version for convex and strongly convex problems. Because of this choice of , (8) always holds. Instead of our multiplicative inexactness criterion, they assume an additive error of the form
Their analysis allows for error in the gradient term in (2). They show that for general convex problems, the objective value converges in an rate under the assumption that and converge. For strongly convex problems, they proved R-linear convergence of , provided that both and decrease linearly to zero. Our analysis, when specialized to proximal gradient and the strongly convex case, shows a Q-linear rate (rather than R-linear) and applies to the convergence of the objective value rather than the iterate.
Algorithm 3 is proposed in SchT16a (); GhaS16a () for convex and strongly convex objective, with inexactness defined additively as in (9). For convex , SchT16a () showed that if and converge then an convergence rate is achievable. The same rate can be achieved if for any . When is -strongly convex, GhaS16a () showed that if is finite (where , is the upper bound for , and is as defined in (6)), then a global R-linear convergence rate is attained. In both cases, the conditions require that decreases at least at a certain speed and, according to their analysis, this tolerance can be achieved by performing more and more inner iterations as increases. As we have noted, our multiplicative criterion can be attained with a fixed number of inner iterations. Moreover, we attain a Q-linear rather than an R-linear result.
Algorithm 1 is also considered in LeeSS14a (), with set either to or a BFGS approximation. Global convergence and a local convergence rate are shown for the exact case. For inexact subproblem solutions, local results are proved under the assumption that the unit step size is always taken (which may not hold true for inexact steps without further precautions being taken). A variant of Algorithm 1 with a different step size criterion is discussed in ByrNO16a (), for the special case of . Inexactness of the subproblem solution is measured by the norm of a proximal-gradient step for . By utilizing specific properties of the norm, they showed a global convergence rate on the norm of the proximal gradient step on to zero, without requiring convexity of . Thus, their result is similar to ours for the nonconvex case. However, their result cannot be extended directly to the case of general , and our inexactness condition does not require the additional cost of computing the proximal gradient step on . When is the Hessian or the BFGS approximation, they obtain for the inexact version local convergence results similar to the exact case proved in LeeSS14a ().
For the case in which is convex, thrice continuously differentiable, and self-concordant, and is the indicator function of a closed convex set, TraKC14a () analyzed global and local convergence rates of inexact damped proximal Newton with a fixed step size. LiAV17a () further extends the convergence analysis to general convex . However, it does not seem possible to generalize these results to general and non-self-concordant .
The remainder of this paper is organized as follows. In Section 2 we introduce notations and preliminaries for further analysis. Convergence analysis appears in Section 3 for Algorithms 1, 2, and 3, covering both convex and nonconvex problems. Some interesting and practical choices of are discussed in Section 4. We provide preliminary numerical results in Section 5. Some final comments appear in Section 6.
2 Notations and Preliminaries
The norm , when applied on vectors, denotes the Euclidean norm, whereas when applied on a symmetric matrix , it denotes the operator norm, which is equivalent to the spectral radius of . For any symmetric matrix , denotes its smallest eigenvalue. For any two symmetric matrices and , (respectively ) denotes that is positive semidefinite (respectively positive definite). For any non-smooth function , denotes the set of its directional derivatives, and when is convex, this is the same as the set of the subdifferential. When the minimum of is attainable, we denote the solution set by
and define as the (Euclidean-norm) projection of onto .
In some results, we use a particular strong convexity assumption to obtain a faster rate. We say that satisfies the optimal set strong convexity condition if there exists such that for any and any , we have
This is a weaker condition than -strong convexity. We do not require the strong convexity to hold globally, but only between the current point and its projection onto the solution set. Some examples of functions that are not strongly convex but satisfy (10) include:
where is strongly convex, and is any matrix;
, where is a polyhedron;
Squared-hinge loss: .
is a descent direction for at if .
From the convexity of , the Lipschitz continuity of , and the mean value theorem that for all , we have
Therefore, if , when is small enough, will be larger than zero, indicating is a descent direction.
The following lemma motivates our algorithms.
If is convex, then any that makes is a descent direction for at .
Note that . Therefore, if is convex, we have
for all . It follows that for all sufficiently small . Therefore, from Lemma 1, is a descent direction, and since and only differ in their lengths, so is .
To ensure the convexity of , we need only positive semidefiniteness of . However, Lemma 2 can be applied even when has negative eigenvalues, as may have a strong convexity property than ensures convexity of . Lemma 2 then suggests that no matter how coarse is the approximate solution of (2), as long as it is better than for a convex , it results in a descent direction. This fact implies finite termination of the subroutine of backtracking line search on the step size in Algorithm 1.
3 Convergence Analysis
We start our analysis for all the three algorithms by showing finite termination of the line search procedures. We then discuss separately three classes of problems involving different assumptions on , namely, that satisfies optimal set strong convexity (10), that is convex, and that is nonconvex. Different iteration complexities are proved in each case.
3.1 Line Search Iteration Bound
We show that the line search procedures have finite termination. The following lemma for the backtracking line search in Algorithm 1 does not require to be positive definite, though it does require strong convexity in defined in (2).
From (3) and strong convexity of , we have that for any ,
Since , we obtain by substituting from the definition of that
Since , we have
This bound holds for any , so we make the following specific choice of in this range:
For this value of , we have
Therefore, (4) is satisfied if
We thus get that (4) holds whenever
This leads to (13), when we introduce a factor to account for possible undershoot of the backtracking procedure.
Lemma 3 suggests that we can still obtain a certain amount of objective decrease as long as is not too negative in comparison to the strong convexity parameter of . However, our main interest is in the case in which is positive semidefinite. When the strong convexity of is completely due to the Hessian of the quadratic part, (that is, ), we obtain the following simpler version of Lemma 3.
One can move the strong convexity between the quadratic term and easily by adding a multiple of to one term and subtracting the same value from the other term. By doing so judiciously, we could always ensure that , so that Corollary 1 can be applied. This convexity shifting will not change the solution of (2), but it will change the value of . We keep both results Lemma 3 and Corollary 1, since convexity-shifting will increase the largest eigenvalue of and might degrade the overall iteration complexity that we consider in the next section.
From the Lipschitz continuity of , we have that
Since and , we have
Therefore, as long as , which is implied by , the sufficient decrease condition (6) is satisfied, so the line search procedure terminates finitely. This observation suggests that in Algorithm 2 the smallest eigenvalue of the final is no larger than , and since the proportion between the largest and the smallest eigenvalues of remains unchanged after scaling the whole matrix, we obtain (18).
For Algorithms 2 and 3, the precision of the subproblem solution does not affect the line search procedure provided the solution is better than the initial point. That is, does not appear in Lemma 4. The precision only affects the final convergence rate. However, such is not the case for Algorithm 1 as we see from the appearance of in Lemma 3 and Corollary 1.
3.2 Iteration Complexity
Now we turn to show the iteration complexity of our algorithms, considering three different assumptions on : optimal set strong convexity, convexity, and the general (possibly nonconvex) case.
The following lemma is modified from some intermediate results in GhaS16a (), where they show R-linear convergence of Algorithm 3 for strongly convex objective when the inexactness is measured by an additive criterion. A proof can be found in the appendix.
Note that we allow in Lemma 5.
Linear Convergence for Optimal Set Strongly Convex Functions
We start with the case that the problem is -optimal-set-strongly convex with some , and show that our algorithms converge linearly.
Assume that is convex, is -optimal-set-strongly convex for some , at every iteration of Algorithm 1, the approximate solution of (2) is at least -approximate for some , and for some and all . Then we have
We therefore have the following.
If for some for all , then the iteration complexity to obtain an -accurate solution is
If the choice of makes at least -strongly convex for some and is positive semidefinite for all , then the iteration complexity to obtain an -accurate solution is
Given any iterate , the sufficient decrease condition (4) and positive semidefiniteness of imply that
where in (26) we used the inexactness condition (3) and in (27) we used (21). Using the results in Lemma 3 and Corollary 1, we obtain the lower bound for in the two scenarios, and the results (23) and (24).
Assume that is convex, is -optimal-set-strongly convex for some , at every iteration of Algorithms 2 and 3, the approximate solution of (2) is at least -approximate for some , and the conditions in Lemma 4 are satisfied for all . Then we have
and thus the iteration complexities for obtaining an -accurate solution are
The step size guarantees in Lemma 3 and Corollary 1 are just lower bounds. By selecting suitably, the line search will generally terminate with a value of not far from to , yielding complexity much better than the worst case suggested by Theorem 3.1. A similar argument applies to (28). When is chosen properly, the convergence rates in Theorem 3.2 can be less dependent on the condition number. The use of Newton and quasi-Newton approximations for often have these desirable properties.
Therefore, Algorithm 1 with positive definite has the best dependency on , Algorithms 2 and 3 are next, and Algorithm 1 with strongly convex and has the worst bound in terms of dependency on . On the other hand, if we move the strong convexity parameter from to , the value of is only affected by an additive factor. Therefore, it is clear that using a positive definite in Algorithm 1 can improve convergence bounds significantly.
Sublinear Convergence for General Convex Problems
We now consider the general convex case. We assume that the level sets of are bounded, and define
Note that boundedness of the level set guarantees that is finite. Using this definition, when , (20) can be rewritten as
for any with . If for some and all , then we have
The following lemma is inspired by (Bac15a, , Lemma 4.4) but contains nontrivial modifications, and will be needed in proving the convergence rate for general convex problems. Its proof can be found in the appendix.
Assume we have two non-negative sequences and , and constants such that
then we have if ,
while in addition
Assume that is convex, at every iteration of Algorithm 1, is chosen such that
When , the convergence rate is Q-linear, that is,
For any , the objective follows a sublinear convergence rate:
Assume that is convex, at every iteration of Algorithms 2 and 3, the initial satisfies the conditions in Lemma 4, and the approximate solution of (2) is at least -approximate for some . Then the following claims hold.