On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition
Abstract
In this paper we consider the cubic regularization (CR) method for minimizing a twice continuously differentiable function. While the CR method is widely recognized as a globally convergent variant of Newton’s method with superior iteration complexity, existing results on its local quadratic convergence require a stringent nondegeneracy condition. We prove that under a local error bound (EB) condition, which is much weaker a requirement than the existing nondegeneracy condition, the sequence of iterates generated by the CR method converges at least Qquadratically to a secondorder critical point. This indicates that adding a cubic regularization not only equips Newton’s method with remarkable global convergence properties but also enables it to converge quadratically even in the presence of degenerate solutions. As a byproduct, we show that without assuming convexity, the proposed EB condition is equivalent to a quadratic growth condition, which could be of independent interest. To demonstrate the usefulness and relevance of our convergence analysis, we focus on two concrete nonconvex optimization problems that arise in phase retrieval and lowrank matrix recovery, respectively, and prove that with overwhelming probability, the sequence of iterates generated by the CR method for solving these two problems converges at least Qquadratically to a global minimizer. We also present numerical results of the CR method when applied to solve these two problems to support and complement our theoretical development.
Keywords: cubic regularization, quadratic convergence, error bound, secondorder critical points, nonisolated solutions, phase retrieval, lowrank matrix recovery
AMS subject classifications: 90C26, 90C30, 65K05, 49M15
1 Introduction
Consider the unconstrained minimization problem
(1) 
where is assumed to be twice continuously differentiable. Newton’s method is widely regarded as an efficient local method for solving problem (1). The cubic regularization (CR) method, which is short for cubic regularized Newton’s method, is a globally convergent variant of Newton’s method. Roughly speaking, given the current iterate , the CR method determines the next one by minimizing a cubic regularized quadratic model of at ; i.e.,
(2) 
where the regularization parameter is chosen such that . The idea of using cubic regularization first appeared in Griewank [15], where he proved that any accumulation point of generated by the CR method is a secondorder critical point of ; i.e., an satisfying and . Later, Nesterov and Polyak [24] presented the remarkable result that the CR method has a better global iteration complexity bound than that for the steepest descent method. Elaborating on these results, Cartis et al. [10, 11] proposed an adaptive CR method for solving problem (1), where are determined dynamically and subproblems (2) are solved inexactly. They showed that the proposed method can still preserve the good global complexity bound established in [24]. Based on these pioneering works, the CR method has been attracting increasing attention over the past decade; see, e.g., [9, 29, 32] and references therein.
In addition to these global convergence properties, the CR method, as a modified Newton’s method, is also expected to attain a fast local convergence rate. It is known that if any accumulation point of the sequence generated by the CR method satisfies
(3) 
then the whole sequence converges at least Qquadratically to ; see [15, Theorem 4.1] or [24, Theorem 3].
Noticing that for any and any orthogonal matrix , it is not hard to see that there is no isolated local minimizer of when , which implies that there is no such that and when . Similar degeneracy features can also be found in various nonconvex optimization formulations used in phase retrieval [28] and deep learning [34]. In view of this, it is natural to study the local convergence properties of the CR method for solving problems with nonisolated minimizers. Moreover, the nondegeneracy condition (3) seems too stringent for the purpose of ensuring quadratic convergence of the CR method. Indeed, one can observe from (2) that due to the cubic regularization, the CR method is well defined even when the Hessian at hand has nonpositive eigenvalues. In addition, the CR method belongs to the class of regularized Newtontype methods, many of which have been shown to attain a superlinear or quadratic convergence rate even in the presence of nonisolated solutions. For instance, Li et al. [18] considered a regularized Newton’s method for solving the convex case of problem (1). They proved that if satisfies a local error bound condition, which is a weaker requirement than (3), then the whole sequence converges superlinearly or quadratically to an optimal solution. Yue et al. [33] extended such result to a regularized proximal Newton’s method for solving a class of nonsmooth convex minimization problems. Other regularized Newtontype methods that have been shown to attain superlinear or quadratic convergence for problems with nonisolated solutions include, among others, the classic LevenbergMarquardt (LM) method [31, 13] for nonlinear equations, Newtontype methods for complementarity problems [30], and regularized GaussNewton methods for nonlinear leastsquares [2].
In this paper we establish the quadratic convergence of the CR method under the assumption of the following local error bound condition.
Definition 1 (EB Condition).
We say that satisfies the local error bound (EB) condition if there exist scalars such that
(4) 
where is the set of secondorder critical points of and denotes the distance of to .
As we shall see in Section 3, the above local EB condition is a weaker requirement than the nondegeneracy condition (3). We prove that if satisfies the above local EB condition, then the whole sequence generated by the CR method converges at least Qquadratically to a secondorder critical point of . This, together with the pioneering works [15, 24, 10], indicates that adding a cubic regularization not only equips Newton’s method with superior global convergence properties but also enables it to converge quadratically even in the presence of degenerate solutions. We remark that our proof of quadratic convergence is not a direct extension of those from the aforementioned works on regularized Newtontype methods. In particular, a major difficulty in our proof is that the descent direction of the CR method is obtained by minimizing a nonconvex function, as one can see from (2). By contrast, the descent directions of the regularized Newtontype methods in [18, 33, 31, 13, 2] are all obtained by minimizing a strongly convex function. For instance, the LM method for solving the nonlinear equation computes its descent direction by solving the strongly convex optimization problem
(5) 
where is the Jacobian of and is the regularization parameter; see [17, 22]. Consequently, we cannot utilize the nice properties of strongly convex functions in our proof. Instead, we shall exploit the fact that any accumulation point of the sequence generated by the CR method is a secondorder critical point of in our analysis. It is also worth noting that our convergence analysis unifies and sharpens those in [24] for the socalled globally nondegenerate starconvex functions and gradientdominated functions (see Section 2 for the definitions). In particular, we show that when applied to these two classes of functions, the CR method converges quadratically, which improves upon the subquadratic convergence rates established in [24].
Besides our convergence analysis of the CR method, the proposed local EB condition could also be of independent interest. A notable feature of the EB condition (4) is that its target set is the set of secondorder critical points of . This contrasts with other EB conditions in the literature, where is typically the set of firstorder critical points (see, e.g., [21]) or the set of optimal solutions (see, e.g., [14, 35]). Such feature makes our EB condition especially useful for analyzing local convergence of iterative algorithms that are guaranteed to cluster at secondorder critical points. Moreover, we prove that under some mild assumptions, our local EB condition is equivalent to a quadratic growth condition (see Theorem 1 (ii) for the definition). Prior to this work, the equivalence between these two regularity conditions was established when is convex [1] or when is nonconvex but satisfies certain quadratic decrease condition [12]. Our result indicates that if the target set is the set of secondorder critical points, then the equivalence of the two regularity conditions can be established without the need of the aforementioned quadratic decrease condition.
To demonstrate the usefulness and relevance of our convergence analysis, we apply it to study the local convergence behavior of the CR method when applied to minimize two concrete nonconvex functions that arise in phase retrieval and lowrank matrix recovery, respectively. A common feature of these nonconvex functions is that they do not have isolated local minimizers. Motivated by recent advances in probabilistic analysis of the global geometry of these nonconvex functions [28, 4], we show that with overwhelming probability, (i) the set of secondorder critical points equals the set of global minimizers and (ii) the local EB condition (4) holds. As a result, our analysis implies that with overwhelming probability, the sequence of iterates generated by the CR method for solving these nonconvex problems converges at least Qquadratically to a global minimizer. Numerical results of the CR method for solving these two nonconvex problems are also presented, which corroborate our theoretical findings.
The rest of this paper is organized as follows. In Section 2, we review existing results on the convergence properties of the CR method. In Section 3, we study the local EB condition (4) and prove its equivalence to a quadratic growth condition. In Section 4, we prove the quadratic convergence of the CR method under the local EB condition. In Section 5, we study the CR method for solving two concrete nonconvex minimization problems that arise in phase retrieval and lowrank matrix recovery, respectively. In Section 6, we present numerical results of the CR method for solving these two nonconvex problems. Finally, we close with some concluding remarks in Section 7.
1.1 Notations
We adopt the following notations throughout the paper. Let be the dimensional Euclidean space and be its standard inner product. For any vector , we denote by its Euclidean norm. Given any and , we denote by the Euclidean ball with center and radius ; i.e., For any matrix , we denote by and its operator norm and Frobenius norm, respectively. If in addition is symmetric, we write as the eigenvalues of in decreasing order. Moreover, we write if is positive semidefinite. We denote by the set of orthogonal matrices; i.e., for any , where is the identity matrix. For any complex vector , we denote by and its real and imaginary parts, respectively. Moreover, we let be the conjugate of , be the Hermitian transpose of , and be the norm of . For any closed subset , we denote by the distance of to . In addition, we use with some to denote the neighborhood of .
For any , we define . We say that is a secondorder critical point of if it satisfies the secondorder necessary condition for ; i.e., and . Unless otherwise stated, we use to denote the set of secondorder critical points of and to denote the set of global minimizers of . It is clear that . Moreover, since is twice continuously differentiable, both and are closed subsets of . We assume throughout the paper that is nonempty.
2 The Cubic Regularization Method
In this section, we review the cubic regularization (CR) method for solving problem (1) and some existing results on its convergence properties.
Given a vector , we define the cubic regularized quadratic approximation of at as
(6) 
where is the regularization parameter. In addition, we define
(7) 
In principle, starting with an initial point , the CR method generates a sequence of iterates by letting for some such that
(8) 
Notice that this requires the computation of , which is a global minimizer of . Although is in general nonconvex, it has been shown in [24] that can be computed by solving a onedimensional convex optimization problem. Moreover, the optimality condition for the global minimizers of is very similar to that of a standard trustregion subproblem [10, Theorem 3.1]. Such observation has led to the development of various efficient algorithms for finding in [10]. More recently, it is shown in [8] that the gradient descent method can also be applied to find .
For the global convergence of the CR method, we need the following assumption.
Assumption 1.
The Hessian of the function is Lipschitz continuous on a closed convex set with ; i.e., there exists a constant such that
(9) 
A direct consequence of Assumption 1 is that for any , it holds that whenever (see [24, Lemma 4]). This further implies that for all , we can find a such that (8) holds. Indeed, if the Lipschitz constant is known, we can let . If not, by using a line search strategy that doubles after each trial [24, Section 5.2], we can find a such that (8) holds. We now state the details of the CR method as follows.
Algorithm 1 (The Cubic Regularization Method).

Input an initial point , a scalar , and set .

Find such that
(10) 
Set and , and go to Step 1.
End.
The following result, which can be found in [15, Theorem 4.1] and [24, Theorem 2], shows that any accumulation point of the sequence generated by the CR method is a secondorder critical point of .
Fact 1.
Suppose that Assumption 1 holds. Let be the sequence of iterates generated by the CR method. If is bounded for some , then the following statements hold.

exists.

.

The sequence has at least one accumulation point. Moreover, every accumulation point of satisfies
We next review some existing results on the local convergence rate of the CR method. We start with the following result, which can be found in [15, Theorem 4.1].
Fact 2.
As discussed in the Introduction, the nondegeneracy condition (11) implies that is an isolated local minimizer of , which does not hold in many applications. In an attempt to overcome such limitation, Nesterov and Polyak [24] considered two classes of functions for which there can be nonisolated secondorder critical points and showed that Algorithm 1 converges superlinearly locally when applied to these functions. The first class is the socalled globally nondegenerate starconvex functions.
Definition 2.
We say that is starconvex if for any ,
(12) 
Definition 3.
We say that the optimal solution set of is globally nondegenerate if there exists a scalar such that
(13) 
Fact 3 ([24, Theorem 5]).
Suppose that Assumption 1 holds, is starconvex, and is globally nondegenerate. Then, there exist a scalar and an integer such that
The second class of functions studied in [24] is the socalled gradientdominated functions.
Definition 4.
We say that is gradientdominated of degree if there exists a scalar such that
(14) 
It is worth mentioning that the inequality (14) is an instance of the Łojasiewicz inequality, which has featured prominently in the convergence analysis of iterative methods; see, e.g., [19] and the references therein. Indeed, recall that is said to satisfy the Łojasiewicz inequality with exponent at if there exist a scalar and a neighborhood of such that
Hence, the inequality (14) is simply the Łojasiewicz inequality at any global minimizer of with and .
Fact 4 ([24, Theorem 7]).
Suppose that Assumption 1 holds and is gradientdominated of degree . Then, there exist a scalar and an integer such that
From the definitions, it is not hard to see that both globally nondegenerate starconvex functions and gradientdominated functions can be nonconvex and can have nonisolated secondorder critical points. Nevertheless, the convergence rates obtained in Facts 3 and 4 are weaker than that in Fact 2 in the following two aspects: (i) only superlinear rates of order and are established for these two classes respectively, while a quadratic rate is achieved in Fact 2; (ii) only the convergence rate of the objective values is proved for these two classes, which is weaker than the convergence rate of the iterates in Fact 2. As we shall see in Section 4, using our analysis approach, the superlinear convergence rates of in Facts 3 and 4 can be improved to the quadratic convergence rate of .
3 Error Bound for the Set of SecondOrder Critical Points
Recall that is the set of secondorder critical points of , which is a closed subset of and assumed to be nonempty. In this section, we are interested in the local error bound (EB) condition (5) for , which we repeat here for the convenience of the readers.
Assumption 2 (EB Condition).
There exist scalars such that
(15) 
Assumption 2 is much weaker than the nondegeneracy assumption (11). Indeed, if satisfies condition (11), then it is routine to show that is an isolated secondorder critical point and there exist scalars such that whenever . On the other hand, the EB condition (15) can still be satisfied when has no isolated secondorder critical points. For instance, it is not hard to verify that , whose set of secondorder critical points is , satisfies the EB condition (15). Furthermore, at the end of this section we shall show that both the globally nondegenerate starconvex functions and the gradientdominated functions considered in Facts 3 and 4 satisfy Assumption 2. In Section 5 we shall show that certain nonconvex functions that arise in phase retrieval and lowrank matrix recovery satisfy Assumption 2 with overwhelming probability.
In what follows, we prove that under some mild assumptions, the EB condition (15) is equivalent to a quadratic growth condition. For any , we denote by a projection of onto ; i.e., .
Theorem 1.
Suppose that is uniformly continuous on for some . Also, suppose that satisfies the following separation property: there exists an such that for any with . Then, the following statements are equivalent.

There exist scalars such that
(16) 
There exist scalars such that
(17)
Before presenting the proof, some remarks on the assumptions in Theorem 1 are in order. First, the uniform continuity of on for some holds if is a compact set. Second, the separation property in Theorem 1 has appeared in [21], in which it was referred to as proper separation of isocost surfaces, and has long played a role in the study of error bounds. It holds for many nonconvex functions in applications and holds trivially if is convex.
Proof of Theorem 1.
We first prove . Suppose that (16) holds with some . Since is uniformly continuous on , there exists a scalar such that
(18) 
Let , be arbitrarily chosen, and for . Thus, for any . By (18), we have
This, together with the inequality for any real symmetric matrices and (see, e.g., [3, Corollary III.2.6]), yields
(19) 
where the last inequality is due to . By the integral form of Taylor’s series, we have
This, together with (19), , and , yields
(20) 
Our next goal is to prove that there exists a scalar such that
(21) 
This would then imply that statement (ii) holds. Suppose that (21) does not hold for any . Then, there exist a sequence and a sequence of positive scalars such that and
(22) 
Without loss of generality, we assume that for all . By (22), we have for all . Let . Hence, and for all . Given any , consider the problem
(23)  
Since is feasible for (23) and , we have . Let be an arbitrary feasible point of (23). Then, it follows from (20) that . In addition, since , we have . This, together with the fact that and our assumption in Theorem 1, implies that . Hence, every feasible point of (23) satisfies , which implies that . Thus, we can conclude that . Combining this with (22), we obtain
(24) 
where . Since , there exists a such that is feasible for (23) for any . By this, (23), (24), and Ekeland’s variational principle (see, e.g., [23, Theorem 2.26]), there exists a sequence such that for all , and
(25)  
s.t. 
Since , we have . In addition, noticing that
we obtain . Hence, there exists a such that is in the interior of the feasible set of (25) for all . Consequently, by the generalized Fermat’s rule (see, e.g., [26, Theorem 10.1]), we have
(26) 
Since is continuously differentiable, we obtain from [23, Corollary 1.82] that . In addition, we have
where the equality follows from [23, Corollary 1.111(i)] and the inclusion is due to [26, Example 8.53]. Also, we have . These, together with (26), yield
(27)  
(28)  
(29) 
where (27) and (28) are due to [26, Exercise 10.10]. By (29), we have
(30) 
Moreover, we have for all from (25). This, together with and (16), yields for all . By this, , and (30), we have
which results in for all . This further leads to
By the definitions of and , the above yields
which is a contradiction since and for all . Therefore, there exists a scalar such that (21) holds, which implies that statement (ii) holds.
We next prove . Suppose that (17) holds with some . Since is uniformly continuous on , there exists a scalar such that
(31) 
Let , be arbitrarily chosen, and for . Using the same arguments as those for (19), one has
(32) 
By (17), (32), and the integral form of Taylor’s series, we obtain
Applying the CauchySchwarz inequality and using , the above yields
Therefore, statement (i) holds as well.∎
Remark. When is convex, reduces to the set of optimal solutions to and it is known that the EB condition (16) is equivalent to the quadratic growth condition (17); see, e.g., [1]. When is nonconvex, Drusvyatskiy et al. [12] studied these two regularity conditions for the set of firstorder critical points (replacing in both (16) and (17) by the set of firstorder critical points) and proved that they are equivalent under an additional quadratic decrease condition; see [12, Theorem 3.1]. Our Theorem 1 is motivated by [12, Theorem 3.1] and shows that for the set of secondorder critical points of a twice continuously differentiable function, the EB condition (16) and the quadratic growth condition (17) are equivalent without requiring the said additional condition.
Corollary 1.
Proof.
Let be an arbitrary secondorder critical point of . By Theorem 1 and Assumption 2, the quadratic growth condition (17) holds for some . Let and be an arbitrary point in . It then follows from (17) that . Moreover, it holds that . By this and the separation property in Theorem 1, we have . Hence, we obtain for all , which implies that is a local minimizer of . ∎
For the rest of this section, we show that the classes of functions considered in Facts 3 and 4 satisfy Assumption 2.
Proposition 1.
Suppose that is starconvex, is globally nondegenerate, and is uniformly continuous on for some . Then, satisfies Assumption 2.
Proof.
We first show that for starconvex functions, the set of secondorder critical points equals the set of optimal solutions; i.e., . Since it is clear that , it suffices to show that . Suppose on the contrary that for some . Hence, and for any . By this and (12), we have that for any ,
which contradicts with . Hence, we obtain . This, together with our assumption in Proposition 1, implies that is uniformly continuous on for some . Also, since , we have for any , which implies that the separation property in Theorem 1 holds. Moreover, by and the assumption that is globally nondegenerate, statement (ii) of Theorem 1 holds. Hence, statement (i) of Theorem 1 holds as well, which implies that satisfies Assumption 2. ∎
Proposition 2.
Suppose that is gradientdominated of degree and is uniformly continuous on for some . Then, satisfies Assumption 2.
Proof.
Due to (14), one can see that for any , we have , which immediately implies that . This, together with , yields . It then follows from the same arguments as those in the proof of Proposition 1 that the premise of Theorem 1 holds. Our next goal is to prove
(33) 
Notice that (33) holds trivially for . Let be arbitrarily chosen. Consider the differential equation
(34) 
Since is continuously differentiable on , it is Lipschitz continuous on any compact subset of . It then follows from the PicardLindelöf Theorem (see, e.g., [16, Theorem II.1.1]) that there exists a such that (34) has a unique solution