A Line-Search Algorithm Inspired by the Adaptive Cubic Regularization Framework and Complexity Analysis

# A Line-Search Algorithm Inspired by the Adaptive Cubic Regularization Framework and Complexity Analysis

E. Bergou MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France (elhoucine.bergou@inra.fr).    Y. Diouane Institut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), Université de Toulouse, 31055 Toulouse Cedex 4, France (youssef.diouane@isae.fr).    S. Gratton INP-ENSEEIHT, Université de Toulouse, 31071 Toulouse Cedex 7, France (serge.gratton@enseeiht.fr).
###### Abstract

Adaptive regularized framework using cubics has emerged as an alternative to line-search and trust-region algorithms for smooth nonconvex optimization, with an optimal complexity amongst second-order methods. In this paper, we propose and analyze the use of an iteration dependent scaled norm in the adaptive regularized framework using cubics. Within such scaled norm, the obtained method behaves as a line-search algorithm along the quasi-Newton direction with a special backtracking strategy. Under appropriate assumptions, the new algorithm enjoys the same convergence and complexity properties as adaptive regularized algorithm using cubics. The complexity for finding an approximate first-order stationary point can be improved to be optimal whenever a second order version of the proposed algorithm is regarded. In a similar way, using the same scaled norm to define the trust-region neighborhood, we show that the trust-region algorithm behaves as a line-search algorithm. The good potential of the obtained algorithms is shown on a set of large scale optimization problems.

Keywords: Nonlinear optimization, unconstrained optimization, line-search methods, adaptive regularized framework using cubics, trust-region methods, worst-case complexity.

## 1 Introduction

An unconstrained nonlinear optimization problem considers the minimization of a scalar function known as the objective function. Classical iterative methods for solving the previous problem are trust-region (TR) [8, 20], line-search (LS) [10] and algorithms using cubic regularization. The latter class of algorithms has been first investigated by Griewank [15] and then by Nesterov and Polyak [18]. Recently, Cartis et al [5] proposed a generalization to an adaptive regularized framework using cubics (ARC).

The worst-case evaluation complexity of finding an -approximate first-order critical point using TR or LS methods is shown to be computed in at most objective function or gradient evaluations, where is a user-defined accuracy threshold on the gradient norm [17, 14, 7]. Under appropriate assumptions, ARC takes at most objective function or gradient evaluations to reduce the gradient of the objective function norm below , and thus it is improving substantially the worst-case complexity over the classical TR/LS methods [4]. Such complexity bound can be improved using higher order regularized models, we refer the reader for instance to the references [2, 6].

More recently, a non-standard TR method [9] is proposed with the same worst-case complexity bound as ARC. It is proved also that the same worst-case complexity can be achieved by mean of a specific variable-norm in a TR method [16] or using quadratic regularization [3]. All previous approaches use a cubic sufficient descent condition instead of the more usual predicted-reduction based descent. Generally, they need to solve more than one linear system in sequence at each outer iteration (by outer iteration, we mean the sequence of the iterates generated by the algorithm), this makes the computational cost per iteration expensive.

In [1], it has been shown how to use the so-called energy norm in the ARC/TR framework when a symmetric positive definite (SPD) approximation of the objective function Hessian is available. Within the energy norm, ARC/TR methods behave as LS algorithms along the Newton direction, with a special backtracking strategy and an acceptability condition in the spirit of ARC/TR methods. As far as the model of the objective function is convex, in [1] the proposed LS algorithm derived from ARC enjoys the same convergence and complexity analysis properties as ARC, in particular the first-order complexity bound of . In the complexity analysis of ARC method [4], it is required that the Hessian approximation has to approximate accurately enough the true Hessian [4, Assumption AM.4], obtaining such convex approximation may be out of reach when handling nonconvex optimization. This paper generalizes the proposed methodology in [1] to handle nonconvex models. We propose to use, in the regularization term of the ARC cubic model, an iteration dependent scaled norm. In this case, ARC behaves as an LS algorithm with a worst-case evaluation complexity of finding an -approximate first-order critical point of function or gradient evaluations. Moreover, under appropriate assumptions, a second order version of the obtained LS algorithm is shown to have a worst-case complexity of .

The use of a scaled norm was first introduced in [8, Section 7.7.1] for TR methods where it was suggested to use the absolute-value of the Hessian matrix in the scaled norm, such choice was described as “the ideal trust region” that reflects the proper scaling of the underlying problem. For a large scale indefinite Hessian matrix, computing its absolute-value is certainly a computationally expensive task as it requires a spectral decomposition. This means that for large scale optimization problems the use of the absolute-value based norm can be seen as out of reach. Our approach in this paper is different as it allows the use of subspace methods.

In fact, as far as the quasi-Newton direction is not orthogonal with the gradient of the objective function at the current iterate, the specific choice of the scaled norm renders the ARC subproblem solution collinear with the quasi-Newton direction. Using subspace methods, we also consider the large-scale setting when the matrix factorizations are not affordable, implying that only iterative methods for computing a trial step can be used. Compared to the classical ARC, when using the Euclidean norm, the dominant computational cost regardless the function evaluation cost of the resulting algorithm is mainly the cost of solving a linear system for successful iterations. Moreover, the cost of the subproblem solution for unsuccessful iterations is getting inexpensive and requires only an update of a scalar. Hence, ARC behaves as an LS algorithm along the quasi-Newton direction, with a special backtracking strategy and an acceptance criteria in the sprite of ARC algorithm.

In this context, the obtained LS algorithm is globally convergent and requires a number of iterations of order to produce an -approximate first-order critical point. A second order version of the algorithm is also proposed, by making use of the exact Hessian or at least of a good approximation of the exact Hessian, to ensure an optimal worst-case complexity bound of order . In this case, we investigate how the complexity bound depends on the quality of the chosen quasi-Newton direction in terms of being a sufficient descent direction. In fact, the obtained complexity bound can be worse than it seems to be whenever the quasi-Newton direction is approximately orthogonal with the gradient of the objective function. Similarly to ARC, we show that the TR method behaves also as an LS algorithm using the same scaled norm as in ARC. Numerical illustrations over a test set of large scale optimization problems are given in order to assess the efficiency of the obtained LS algorithms.

The proposed analysis in this paper assumes that the quasi-Newton direction is not orthogonal with the gradient of the objective function during the minimization process. When such assumption is violated, one can either modify the Hessian approximation using regularization techniques or, when a second order version of the LS algorithm is regarded, switch to the classical ARC algorithm using the Euclidean norm until this assumption holds. In the latter scenario, we propose to check first if there exists an approximate quasi-Newton direction, among all the iterates generated using a subspace method, which is not orthogonal with the gradient and that satisfies the desired properties. If not, one minimizes the model using the Euclidean norm until a new successful outer iteration is found.

We organize this paper as follows. In Section 2, we introduce the ARC method using a general scaled norm and derive the obtained LS algorithm on the base of ARC when a specific scaled norm is used. Section 3 analyses the minimization of the cubic model and discusses the choice of the scaled norm that simplifies solving the ARC subproblem. Section 4 discusses first how the iteration dependent can be chosen uniformly equivalent to the Euclidean norm, and then we propose a second order LS algorithm that enjoys the optimal complexity bound. The section ends with a detailed complexity analysis of the obtained algorithm. Similarly to ARC and using the same scaled norm, an LS algorithm in the spirit of TR algorithm is proposed in Section 5. Numerical tests are illustrated and discussed in Section 6. Conclusions and future improvements are given in Section 7.

## 2 ARC Framework Using a Specific Mk-Norm

### 2.1 ARC Framework

We consider a problem of unconstrained minimization of the form

 minx∈Rnf(x), (1)

where the objective function is assumed to be continuously differentiable. The ARC framework [5] can be described as follows: at a given iterate , we define as an approximate second-order Taylor approximation of the objective function around , i.e.,

 mQk(s) = f(xk)+s⊤gk+12s⊤Bks, (2)

where is the gradient of at the current iterate , and is a symmetric local approximation (uniformly bounded from above) of the Hessian of at . The trial step approximates the global minimizer of the cubic model , i.e.,

 sk≈argmins∈Rnmk(s), (3)

where denotes an iteration dependent scaled norm of the form for all and is a given SPD matrix. is a dynamic positive parameter that might be regarded as the reciprocal of the TR radius in TR algorithms (see [5]). The parameter is taking into account the agreement between the objective function and the model .

To decide whether the trial step is acceptable or not a ratio between the actual reduction and the predicted reduction is computed, as follows:

 ρk=f(xk)−f(xk+sk)f(xk)−mQk(sk). (4)

For a given scalar , the outer iteration will be said successful if , and unsuccessful otherwise. For all successful iterations we set ; otherwise the current iterate is kept unchanged . We note that, unlike the original ARC [5, 4] where the cubic model is used to evaluate the denominator in (4), in the nowadays works related to ARC, only the quadratic approximation is used in the comparison with the actual value of without the regularization parameter (see [2] for instance). Algorithm 1 gives a detailed description of ARC.

The Cauchy step , defined in Algorithm 1, is computationally inexpensive compared to the computational cost of the global minimizer of . The condition (5) on is sufficient for ensuring global convergence of ARC to first-order critical points.

From now on, we will assume that first-order stationarity is not reached yet, meaning that the gradient of the objective function is non null at the current iteration (i.e., ). Also, will denote the vector or matrix -norm, the sign of a real , and the identity matrix of size .

### 2.2 An LS Algorithm Inspired by the ARC Framework

Using a specific -norm in the definition of the cubic model , we will show that ARC framework (Algorithm 1) behaves as an LS algorithm along the quasi-Newton direction. In a previous work [1], when the matrix is assumed to be positive definite, we showed that the minimizer of the cubic model defined in (3) is getting collinear with the quasi-Newton direction when the matrix is set to be equal to . In this section we generalize our proposed approach to cover the case where the linear system admits an approximate solution and is not necessarily SPD.

Let be an approximate solution of the linear system and assume that such step is not orthogonal with the gradient of the objective function at , i.e., there exists an such that . Suppose that there exists an SPD matrix such that where and , in Theorem 4.1 we will show that such matrix exists. By using the associated -norm in the definition of the cubic model , one can show (see Theorem 3.3) that an approximate stationary point of the subproblem (3) is of the form

 sk = δksQk,  where  δk=21−sgn(g⊤ksQk)√1+4σk∥sQk∥3Mk|g⊤ksQk|. (6)

For unsuccessful iterations in Algorithm 1, since the step direction does not change, the approximate solution of the subproblem, given by (6), can be obtained only by updating the step-size . This means that the subproblem computational cost of unsuccessful iterations is getting straightforward compared to solving the subproblem as required by ARC when the Euclidean norm is used (see e.g., [5]). As a consequence, the use of the proposed -norm in Algorithm 1 will lead to a new formulation of ARC algorithm where the dominant computational cost, regardless the objective function evaluation cost, is the cost of solving a linear system for successful iterations. In other words, with the proposed -norm, the ARC algorithm behaves as an LS method with a specific backtracking strategy and an acceptance criteria in the sprite of ARC algorithm, i.e., the step is of the form where the step length is chosen such as

 ρk=f(xk)−f(xk+sk)f(xk)−mQk(sk)≥η and mk(sk)≤mk(−δckgk). (7)

The step lengths and are computed respectively as follows:

 δk = 21−sgn(g⊤ksQk)√1+4σkβ3/2k∥sQk∥3|g⊤ksQk|, (8)

and

 δck = (9)

where and . The -norms of the vectors and in the computation of and have been substituted using the expressions given in Theorem 4.1. The value of is set equal to the current value of the regularization parameter as in the original ARC algorithm. For large values of the decrease condition (7) may not be satisfied. In this case, the value of is enlarged using an expansion factor . Iteratively, the value of is updated and the acceptance condition (7) is checked again, until its satisfaction.

We referred the ARC algorithm, when the proposed scaled -norm is used, by LS-ARC as it behaves as an LS type method in this case. Algorithm 2 details the final algorithm. We recall again that this algorithm is nothing but ARC algorithm using a specific -norm.

Note that in Algorithm 2 the step may not exist or be approximately orthogonal with the gradient . A possible way to overcome this issue can be ensured by modifying the matrix using regularization techniques. In fact, as far as the Hessian approximation is still uniformly bounded from above, the global convergence will still hold as well as a complexity bound of order to drive the norm of the gradient below (see [5] for instance).

The complexity bound can be improved to be of the order of if a second order version of the algorithm LS-ARC is used, by making equals to the exact Hessian or at least being a good approximation of the exact Hessian (as in Assumption 4.2 of Section 4). In this case, modify the matrix using regularization techniques such that the step approximates the linear system and is not trivial anymore. This second order version of the algorithm LS-ARC will be discussed in details in Section 4 where convergence and complexity analysis, when the proposed -norm is used, will be outlined.

## 3 On the Cubic Model Minimization

In this section, we assume that the linear system has a solution. We will mostly focus on the solution of the subproblem (3) for a given outer iteration . In particular, we will explicit the condition to impose on the matrix in order to get the solution of the ARC subproblem collinear with the step . Hence, in such case, one can get the solution of the ARC subproblem at a modest computational cost.

The step can be obtained exactly using a direct method if the matrix is not too large. Typically, one can use the factorization to solve this linear system. For large scale optimization problems, computing can be prohibitively computationally expensive. We will show that it will be possible to relax this requirement by letting the step be only an approximation of the exact solution using subspace methods.

In fact, when an approximate solution is used and as far as the global convergence of Algorithm 1 is concerned, all what is needed is that the solution of the subproblem (3) yields a decrease in the cubic model which is as good as the Cauchy decrease (as emphasized in condition (5)). In practice, a version of Algorithm 2 solely based on the Cauchy step would suffer from the same drawbacks as the steepest descent algorithm on ill-conditioned problems and faster convergence can be expected if the matrix influences also the minimization direction. The main idea consists of achieving a further decrease on the cubic model, better than the Cauchy decrease, by projection onto a sequence of embedded Krylov subspaces. We now show how to use a similar idea to compute a solution of the subproblem that is computationally cheap and yields the global convergence of Algorithm 2.

A classical way to approximate the exact solution is by using subspace methods, typically a Krylov subspace method. For that, let be a subspace of and its dimension. Let denotes an matrix whose columns form a basis of . Thus for all , we have , for some . In this case, denotes the exact stationary point of the model function over the subspace when it exists.

For both cases, exact and inexact, we will assume that the step is not orthogonal with the gradient . In what comes next, we state our assumption on formally as follows:

###### Assumption 3.1

The model admits a stationary point such as where is a pre-defined positive constant.

We define also a Newton-like step associated with the minimization of the cubic model over the subspace on the following way, when corresponds to the exact solution of by

 sNk=δNksQk,  where  δNk=argminδ∈Ikmk(δsQk), (10)

where if and otherwise. If is computed using an iterative subspace method, then where is the Newton-like step , as in (10), associated to the following reduced subproblem:

 minz∈Rlf(xk)+z⊤Q⊤kgk+12z⊤Q⊤kBkQkz+13σk∥z∥3Q⊤kMkQk. (11)
###### Theorem 3.1

Let Assumption 3.1 hold. The Newton-like step is of the form

 sNk=δNksQk,  where  δNk = 21−sgn(g⊤ksQk)√1+4σk∥sQk∥3Mk|g⊤ksQk|. (12)

Proof. Consider first the case where the step is computed exactly (i.e., ). In this case, for all , one has

 mk(δsQk)−mk(0) = δg⊤ksQk+δ22[sQk]⊤Bk[sQk]+σ|δ|33∥sQk∥3Mk (13) = (g⊤ksQk)δ−(g⊤ksQk)δ22+(σk∥sQk∥3Mk)|δ|33.

If (hence, ), we compute the value of the parameter at which the unique minimizer of the above function is attained. Taking the derivative of (13) with respect to and equating the result to zero, one gets

 0 = g⊤ksQk−(g⊤ksQk)δNk+σk∥sQk∥3Mk(δNk)2, (14)

and thus, since ,

 δNk = g⊤ksQk+√(g⊤ksQk)2−4σk(g⊤ksQk)∥sQk∥3Mk2σk∥sQk∥3Mk=21+√1−4σk∥sQk∥3Mkg⊤ksQk.

If (hence, ),and again by taking the derivative of (13) with respect to and equating the result to zero, one gets

 0 = g⊤ksQk−(g⊤ksQk)δNk−σk∥sQk∥3Mk(δNk)2, (15)

and thus, since in this case,

 δNk = g⊤ksQk+√(g⊤ksQk)2+4σk(g⊤ksQk)∥sQk∥3Mk2σk∥sQk∥3Mk=21−√1+4σk∥sQk∥3Mkg⊤ksQk.

From both cases, one deduces that

Consider now the case where is computed using an iterative subspace method. In this cas, one has , where is the Newton-like step associated to the reduced subproblem (11). Hence by applying the first part of the proof (the exact case) to the reduced subproblem (11), it follows that

 zNk=¯δNkzQk   where~{}~{}¯δNk=21−sgn((Q⊤kgk)⊤zQk) ⎷1+4σk∥zQk∥3Q⊤kMkQk|(Q⊤kgk)⊤zQk|,

where is a stationary point of the quadratic part of the minimized model in (11). Thus, by substituting in the formula , one gets

 sNk = Qk⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝21−sgn((Q⊤kgk)⊤zQk) ⎷1+4σk∥zQk∥3Q⊤kMkQk|Q⊤kgk)⊤zQk|zQk⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ = 21−sgn(g⊤kQkzQk)√1+4σk∥QkzQk∥3Mk|g⊤kQkzQk|QkzQk = 21−sgn(g⊤ksQk)√1+4σk∥sQk∥3Mk|g⊤ksQk|sQk.

In general, for ARC algorithm, the matrix can be any arbitrary SPD matrix. Our goal, in this section, is to determine how one can choose the matrix so that the Newton-like step becomes a stationary point of the subproblem (3). The following theorem gives explicitly the necessary and sufficient condition on the matrix to reach this aim.

###### Theorem 3.2

Let Assumption 3.1 hold. The step is a stationary point for the subproblem (3) if and only if there exists such that . Note that .

Proof. Indeed, in the exact case, if we suppose that the step is a stationary point of the subproblem (3), this means that

 ∇mk(sNk) = gk+BksNk+σk∥sNk∥MkMksNk=0, (16)

In another hand, where is solution of (such equation can be deduced from (14) and (15)). Hence, we obtain that

 0 = ∇mk(sNk)=gk−δNkgk+σk|δNk|δNk∥sQk∥MkMksQk = (1−δNk)gk+⎛⎜⎝σk∥sQk∥3Mkg⊤ksQk|δNk|δNk⎞⎟⎠⎛⎝g⊤ksQk∥sQ∥2MkMksQk⎞⎠ =

Equivalently, we conclude that where . A similar proof applies when a subspace method is used to compute .

The key condition to ensure that the ARC subproblem stationary point is equal to the Newton-like step , is the choice of the matrix which satisfies the following secant-like equation for a given . The existence of such matrix is not problematic as far as Assumption 3.1 holds. In fact, Theorem 4.1 will explicit a range of for which the matrix exists. Note that in the formula of , such matrix is used only through the computation of the -norm of . Therefore an explicit formula of the matrix is not needed, and only the value of suffices for the computations.

When the matrix satisfies the desired properties (as in Theorem 3.2), one is ensured that is a stationary point for the model . However, ARC algorithm imposes on the approximate step to satisfy the Cauchy decrease given by (5), and such condition is not guaranteed by as the model may be non-convex. In the next theorem, we show that for a sufficiently large , is getting the global minimizer of and thus satisfying the Cauchy decrease is not an issue anymore.

###### Theorem 3.3

Let Assumption 3.1 hold. Let be an SPD matrix which satisfies for a fixed . If the matrix is positive definite, then the step is the unique minimizer of the subproblem (3) over the subspace .

Proof. Indeed, when is computed exactly (i.e., and ), then using [8, Theorem 3.1] one has that, for a given vector , it is a global minimizer of if and only if it satisfies

 (Bk+λ∗kMk)s∗k=−gk

where is positive semi-definite matrix and . Moreover, if is positive definite, is unique.

Since , by applying Theorem 3.2, we see that

 (Bk+λNkMk)sNk=−gk

with . Thus, if we assume that is positive definite matrix, then is the unique global minimizer of the subproblem (3).

Consider now, the case where is computed using a subspace method, since one has that . Hence, if we suppose that the matrix is positive definite, by applying the same proof of the exact case to the reduced subproblem (11), we see that the step is the unique global minimizer of the subproblem (11). We conclude that is the global minimizer of the subproblem (3) over the subspace .

Theorem 3.3 states that the step is the global minimizer of the cubic model over the subspace as far as the matrix is positive definite, where . Note that

 λNk = σk∥sNk∥Mk=2σk∥sQk∥Mk∣∣ ∣∣1−sgn(g⊤ksQk)√1+4σk∥sQk∥3Mk|g⊤ksQk|∣∣ ∣∣→+∞    as σk→∞.

Thus, since is an SPD matrix and the regularization parameter is increased for unsuccessful iterations in Algorithm 1, the positive definiteness of matrix is guaranteed after finitely many unsuccessful iterations. In other words, one would have insurance that will satisfy the Cauchy decrease after a certain number of unsuccessful iterations.

## 4 Complexity Analysis of the LS-ARC Algorithm

For the well definiteness of the algorithm LS-ARC, one needs first to show that the proposed -norm is uniformly equivalent to the Euclidean norm. The next theorem gives a range of choices for the parameter to ensure the existence of an SPD matrix such as and the -norm is uniformly equivalent to the -norm.

###### Theorem 4.1

Let Assumption 3.1 hold. If

 θk=βk∥sQk∥2  where βk∈]βmin,βmax[  and  βmax>βmin>0,
(17)

then there exists an SPD matrix such as

1. ,

2. the -norm is uniformly equivalent to the -norm on and for all , one has

 √βmin√2∥x∥≤∥x∥Mk≤√2βmaxϵd∥x∥. (18)
3. Moreover, one has and where and

Proof. Let and be an orthonormal vector to (i.e., and ) such that

 gk∥gk∥=cos(ϖk)¯sQk+sin(ϖk)¯gk. (19)

For a given one would like to construct an SPD matrix such as , hence

 Mk¯sQk=θkgkg⊤ksQk∥sQk∥ = θk∥gk∥g⊤ksQk∥sQk∥(cos(ϖk)¯sQk+sin(ϖk)¯gk) = βk¯sQk+βktan(ϖk)¯gk.

Using the symmetric structure of the matrix , let be a positive parameter such as

 Mk=[¯sQk,¯gk]Nk[¯sQk,¯gk]⊤ where Nk=[βkβktan(ϖk)βktan(ϖk)γk].

The eigenvalues and of the matrix are the roots of

hence

 λmink=(βk+γk)−√ϑk2 and λmaxk=(βk+γk)+√ϑk2,

where Note that both eigenvalues are monotonically increasing as functions of .

One may choose to be equal to , therefore is uniformly bounded away from zero. In this case, from the expression of , we deduce that and

 λmaxk = 34βk+βktan(ϖk)2+√116β2k+12β