On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

Abstract

In this paper we consider the cubic regularization (CR) method for minimizing a twice continuously differentiable function. While the CR method is widely recognized as a globally convergent variant of Newton’s method with superior iteration complexity, existing results on its local quadratic convergence require a stringent non-degeneracy condition. We prove that under a local error bound (EB) condition, which is much weaker a requirement than the existing non-degeneracy condition, the sequence of iterates generated by the CR method converges at least Q-quadratically to a second-order critical point. This indicates that adding a cubic regularization not only equips Newton’s method with remarkable global convergence properties but also enables it to converge quadratically even in the presence of degenerate solutions. As a byproduct, we show that without assuming convexity, the proposed EB condition is equivalent to a quadratic growth condition, which could be of independent interest. To demonstrate the usefulness and relevance of our convergence analysis, we focus on two concrete nonconvex optimization problems that arise in phase retrieval and low-rank matrix recovery, respectively, and prove that with overwhelming probability, the sequence of iterates generated by the CR method for solving these two problems converges at least Q-quadratically to a global minimizer. We also present numerical results of the CR method when applied to solve these two problems to support and complement our theoretical development.


Keywords: cubic regularization, quadratic convergence, error bound, second-order critical points, non-isolated solutions, phase retrieval, low-rank matrix recovery


AMS subject classifications: 90C26, 90C30, 65K05, 49M15

1 Introduction

Consider the unconstrained minimization problem

(1)

where is assumed to be twice continuously differentiable. Newton’s method is widely regarded as an efficient local method for solving problem (1). The cubic regularization (CR) method, which is short for cubic regularized Newton’s method, is a globally convergent variant of Newton’s method. Roughly speaking, given the current iterate , the CR method determines the next one by minimizing a cubic regularized quadratic model of at ; i.e.,

(2)

where the regularization parameter is chosen such that . The idea of using cubic regularization first appeared in Griewank [15], where he proved that any accumulation point of generated by the CR method is a second-order critical point of ; i.e., an satisfying and . Later, Nesterov and Polyak [24] presented the remarkable result that the CR method has a better global iteration complexity bound than that for the steepest descent method. Elaborating on these results, Cartis et al. [10, 11] proposed an adaptive CR method for solving problem (1), where are determined dynamically and subproblems (2) are solved inexactly. They showed that the proposed method can still preserve the good global complexity bound established in [24]. Based on these pioneering works, the CR method has been attracting increasing attention over the past decade; see, e.g.[9, 29, 32] and references therein.

In addition to these global convergence properties, the CR method, as a modified Newton’s method, is also expected to attain a fast local convergence rate. It is known that if any accumulation point of the sequence generated by the CR method satisfies

(3)

then the whole sequence converges at least Q-quadratically to ; see [15, Theorem 4.1] or [24, Theorem 3].1 Nevertheless, the non-degeneracy condition (3) implies that is an isolated local minimizer of and hence does not hold for many nonconvex functions in real-world applications. For example, consider the problem of recovering a positive semidefinite matrix with rank , given a linear operator and a measurement vector . A practically efficient approach for recovering is to solve the following nonconvex minimization problem (see, e.g., [4]):

Noticing that for any and any orthogonal matrix , it is not hard to see that there is no isolated local minimizer of when , which implies that there is no such that and when . Similar degeneracy features can also be found in various nonconvex optimization formulations used in phase retrieval [28] and deep learning [34]. In view of this, it is natural to study the local convergence properties of the CR method for solving problems with non-isolated minimizers. Moreover, the non-degeneracy condition (3) seems too stringent for the purpose of ensuring quadratic convergence of the CR method. Indeed, one can observe from (2) that due to the cubic regularization, the CR method is well defined even when the Hessian at hand has non-positive eigenvalues. In addition, the CR method belongs to the class of regularized Newton-type methods, many of which have been shown to attain a superlinear or quadratic convergence rate even in the presence of non-isolated solutions. For instance, Li et al. [18] considered a regularized Newton’s method for solving the convex case of problem (1). They proved that if satisfies a local error bound condition, which is a weaker requirement than (3), then the whole sequence converges superlinearly or quadratically to an optimal solution. Yue et al. [33] extended such result to a regularized proximal Newton’s method for solving a class of nonsmooth convex minimization problems. Other regularized Newton-type methods that have been shown to attain superlinear or quadratic convergence for problems with non-isolated solutions include, among others, the classic Levenberg-Marquardt (LM) method [31, 13] for nonlinear equations, Newton-type methods for complementarity problems [30], and regularized Gauss-Newton methods for nonlinear least-squares [2].

In this paper we establish the quadratic convergence of the CR method under the assumption of the following local error bound condition.

Definition 1 (EB Condition).

We say that satisfies the local error bound (EB) condition if there exist scalars such that

(4)

where is the set of second-order critical points of and denotes the distance of to .

As we shall see in Section 3, the above local EB condition is a weaker requirement than the non-degeneracy condition (3). We prove that if satisfies the above local EB condition, then the whole sequence generated by the CR method converges at least Q-quadratically to a second-order critical point of . This, together with the pioneering works [15, 24, 10], indicates that adding a cubic regularization not only equips Newton’s method with superior global convergence properties but also enables it to converge quadratically even in the presence of degenerate solutions. We remark that our proof of quadratic convergence is not a direct extension of those from the aforementioned works on regularized Newton-type methods. In particular, a major difficulty in our proof is that the descent direction of the CR method is obtained by minimizing a nonconvex function, as one can see from (2). By contrast, the descent directions of the regularized Newton-type methods in [18, 33, 31, 13, 2] are all obtained by minimizing a strongly convex function. For instance, the LM method for solving the nonlinear equation computes its descent direction by solving the strongly convex optimization problem

(5)

where is the Jacobian of and is the regularization parameter; see [17, 22]. Consequently, we cannot utilize the nice properties of strongly convex functions in our proof. Instead, we shall exploit the fact that any accumulation point of the sequence generated by the CR method is a second-order critical point of in our analysis. It is also worth noting that our convergence analysis unifies and sharpens those in [24] for the so-called globally non-degenerate star-convex functions and gradient-dominated functions (see Section 2 for the definitions). In particular, we show that when applied to these two classes of functions, the CR method converges quadratically, which improves upon the sub-quadratic convergence rates established in [24].

Besides our convergence analysis of the CR method, the proposed local EB condition could also be of independent interest. A notable feature of the EB condition (4) is that its target set is the set of second-order critical points of . This contrasts with other EB conditions in the literature, where is typically the set of first-order critical points (see, e.g.[21]) or the set of optimal solutions (see, e.g.[14, 35]). Such feature makes our EB condition especially useful for analyzing local convergence of iterative algorithms that are guaranteed to cluster at second-order critical points. Moreover, we prove that under some mild assumptions, our local EB condition is equivalent to a quadratic growth condition (see Theorem 1 (ii) for the definition). Prior to this work, the equivalence between these two regularity conditions was established when is convex [1] or when is nonconvex but satisfies certain quadratic decrease condition [12]. Our result indicates that if the target set is the set of second-order critical points, then the equivalence of the two regularity conditions can be established without the need of the aforementioned quadratic decrease condition.

To demonstrate the usefulness and relevance of our convergence analysis, we apply it to study the local convergence behavior of the CR method when applied to minimize two concrete nonconvex functions that arise in phase retrieval and low-rank matrix recovery, respectively. A common feature of these nonconvex functions is that they do not have isolated local minimizers. Motivated by recent advances in probabilistic analysis of the global geometry of these nonconvex functions [28, 4], we show that with overwhelming probability, (i) the set of second-order critical points equals the set of global minimizers and (ii) the local EB condition (4) holds. As a result, our analysis implies that with overwhelming probability, the sequence of iterates generated by the CR method for solving these nonconvex problems converges at least Q-quadratically to a global minimizer. Numerical results of the CR method for solving these two nonconvex problems are also presented, which corroborate our theoretical findings.

The rest of this paper is organized as follows. In Section 2, we review existing results on the convergence properties of the CR method. In Section 3, we study the local EB condition (4) and prove its equivalence to a quadratic growth condition. In Section 4, we prove the quadratic convergence of the CR method under the local EB condition. In Section 5, we study the CR method for solving two concrete nonconvex minimization problems that arise in phase retrieval and low-rank matrix recovery, respectively. In Section 6, we present numerical results of the CR method for solving these two nonconvex problems. Finally, we close with some concluding remarks in Section 7.

1.1 Notations

We adopt the following notations throughout the paper. Let be the -dimensional Euclidean space and be its standard inner product. For any vector , we denote by its Euclidean norm. Given any and , we denote by the Euclidean ball with center and radius ; i.e., For any matrix , we denote by and its operator norm and Frobenius norm, respectively. If in addition is symmetric, we write as the eigenvalues of in decreasing order. Moreover, we write if is positive semidefinite. We denote by the set of orthogonal matrices; i.e., for any , where is the identity matrix. For any complex vector , we denote by and its real and imaginary parts, respectively. Moreover, we let be the conjugate of , be the Hermitian transpose of , and be the norm of . For any closed subset , we denote by the distance of to . In addition, we use with some to denote the neighborhood of .

For any , we define . We say that is a second-order critical point of if it satisfies the second-order necessary condition for ; i.e., and . Unless otherwise stated, we use to denote the set of second-order critical points of and to denote the set of global minimizers of . It is clear that . Moreover, since is twice continuously differentiable, both and are closed subsets of . We assume throughout the paper that is non-empty.

2 The Cubic Regularization Method

In this section, we review the cubic regularization (CR) method for solving problem (1) and some existing results on its convergence properties.

Given a vector , we define the cubic regularized quadratic approximation of at as

(6)

where is the regularization parameter. In addition, we define

(7)

In principle, starting with an initial point , the CR method generates a sequence of iterates by letting for some such that

(8)

Notice that this requires the computation of , which is a global minimizer of . Although is in general nonconvex, it has been shown in [24] that can be computed by solving a one-dimensional convex optimization problem. Moreover, the optimality condition for the global minimizers of is very similar to that of a standard trust-region subproblem [10, Theorem 3.1]. Such observation has led to the development of various efficient algorithms for finding in [10]. More recently, it is shown in [8] that the gradient descent method can also be applied to find .

For the global convergence of the CR method, we need the following assumption.

Assumption 1.

The Hessian of the function is Lipschitz continuous on a closed convex set with ; i.e., there exists a constant such that

(9)

A direct consequence of Assumption 1 is that for any , it holds that whenever (see [24, Lemma 4]). This further implies that for all , we can find a such that (8) holds. Indeed, if the Lipschitz constant is known, we can let . If not, by using a line search strategy that doubles after each trial [24, Section 5.2], we can find a such that (8) holds. We now state the details of the CR method as follows.

Algorithm 1 (The Cubic Regularization Method).

  • Input an initial point , a scalar , and set .

  • Find such that

    (10)
  • Set and , and go to Step 1.

End.

The following result, which can be found in [15, Theorem 4.1] and [24, Theorem 2], shows that any accumulation point of the sequence generated by the CR method is a second-order critical point of .

Fact 1.

Suppose that Assumption 1 holds. Let be the sequence of iterates generated by the CR method. If is bounded for some , then the following statements hold.

  1. exists.

  2. .

  3. The sequence has at least one accumulation point. Moreover, every accumulation point of satisfies

We next review some existing results on the local convergence rate of the CR method. We start with the following result, which can be found in [15, Theorem 4.1].

Fact 2.

Suppose that Assumption 1 holds. Let be the sequence generated by Algorithm 1 for solving problem (1). If an accumulation point of satisfies

(11)

then the whole sequence converges at least Q-quadratically to .

As discussed in the Introduction, the non-degeneracy condition (11) implies that is an isolated local minimizer of , which does not hold in many applications. In an attempt to overcome such limitation, Nesterov and Polyak [24] considered two classes of functions for which there can be non-isolated second-order critical points and showed that Algorithm 1 converges superlinearly locally when applied to these functions. The first class is the so-called globally non-degenerate star-convex functions.

Definition 2.

We say that is star-convex if for any ,

(12)
Definition 3.

We say that the optimal solution set of is globally non-degenerate if there exists a scalar such that

(13)
Fact 3 ([24, Theorem 5]).

Suppose that Assumption 1 holds, is star-convex, and is globally non-degenerate. Then, there exist a scalar and an integer such that

The second class of functions studied in [24] is the so-called gradient-dominated functions.

Definition 4.

We say that is gradient-dominated of degree if there exists a scalar such that

(14)

It is worth mentioning that the inequality (14) is an instance of the Łojasiewicz inequality, which has featured prominently in the convergence analysis of iterative methods; see, e.g.[19] and the references therein. Indeed, recall that is said to satisfy the Łojasiewicz inequality with exponent at if there exist a scalar and a neighborhood of such that

Hence, the inequality (14) is simply the Łojasiewicz inequality at any global minimizer of with and .

Fact 4 ([24, Theorem 7]).

Suppose that Assumption 1 holds and is gradient-dominated of degree . Then, there exist a scalar and an integer such that

From the definitions, it is not hard to see that both globally non-degenerate star-convex functions and gradient-dominated functions can be nonconvex and can have non-isolated second-order critical points. Nevertheless, the convergence rates obtained in Facts 3 and 4 are weaker than that in Fact 2 in the following two aspects: (i) only superlinear rates of order and are established for these two classes respectively, while a quadratic rate is achieved in Fact 2; (ii) only the convergence rate of the objective values is proved for these two classes, which is weaker than the convergence rate of the iterates in Fact 2. As we shall see in Section 4, using our analysis approach, the superlinear convergence rates of in Facts 3 and 4 can be improved to the quadratic convergence rate of .

3 Error Bound for the Set of Second-Order Critical Points

Recall that is the set of second-order critical points of , which is a closed subset of and assumed to be non-empty. In this section, we are interested in the local error bound (EB) condition (5) for , which we repeat here for the convenience of the readers.

Assumption 2 (EB Condition).

There exist scalars such that

(15)

Assumption 2 is much weaker than the non-degeneracy assumption (11). Indeed, if satisfies condition (11), then it is routine to show that is an isolated second-order critical point and there exist scalars such that whenever . On the other hand, the EB condition (15) can still be satisfied when has no isolated second-order critical points. For instance, it is not hard to verify that , whose set of second-order critical points is , satisfies the EB condition (15). Furthermore, at the end of this section we shall show that both the globally non-degenerate star-convex functions and the gradient-dominated functions considered in Facts 3 and 4 satisfy Assumption 2. In Section 5 we shall show that certain nonconvex functions that arise in phase retrieval and low-rank matrix recovery satisfy Assumption 2 with overwhelming probability.

In what follows, we prove that under some mild assumptions, the EB condition (15) is equivalent to a quadratic growth condition. For any , we denote by a projection of onto ; i.e., .

Theorem 1.

Suppose that is uniformly continuous on for some . Also, suppose that satisfies the following separation property: there exists an such that for any with . Then, the following statements are equivalent.

  1. There exist scalars such that

    (16)
  2. There exist scalars such that

    (17)

Before presenting the proof, some remarks on the assumptions in Theorem 1 are in order. First, the uniform continuity of on for some holds if is a compact set. Second, the separation property in Theorem 1 has appeared in [21], in which it was referred to as proper separation of isocost surfaces, and has long played a role in the study of error bounds. It holds for many nonconvex functions in applications and holds trivially if is convex.

Proof of Theorem 1.

We first prove . Suppose that (16) holds with some . Since is uniformly continuous on , there exists a scalar such that

(18)

Let , be arbitrarily chosen, and for . Thus, for any . By (18), we have

This, together with the inequality for any real symmetric matrices and (see, e.g.[3, Corollary III.2.6]), yields

(19)

where the last inequality is due to . By the integral form of Taylor’s series, we have

This, together with (19), , and , yields

(20)

Our next goal is to prove that there exists a scalar such that

(21)

This would then imply that statement (ii) holds. Suppose that (21) does not hold for any . Then, there exist a sequence and a sequence of positive scalars such that and

(22)

Without loss of generality, we assume that for all . By (22), we have for all . Let . Hence, and for all . Given any , consider the problem

(23)

Since is feasible for (23) and , we have . Let be an arbitrary feasible point of (23). Then, it follows from (20) that . In addition, since , we have . This, together with the fact that and our assumption in Theorem 1, implies that . Hence, every feasible point of (23) satisfies , which implies that . Thus, we can conclude that . Combining this with (22), we obtain

(24)

where . Since , there exists a such that is feasible for (23) for any . By this, (23), (24), and Ekeland’s variational principle (see, e.g., [23, Theorem 2.26]), there exists a sequence such that for all , and

(25)
s.t.

Since , we have . In addition, noticing that

we obtain . Hence, there exists a such that is in the interior of the feasible set of (25) for all . Consequently, by the generalized Fermat’s rule (see, e.g., [26, Theorem 10.1]), we have

(26)

Since is continuously differentiable, we obtain from [23, Corollary 1.82] that . In addition, we have

where the equality follows from [23, Corollary 1.111(i)] and the inclusion is due to [26, Example 8.53]. Also, we have . These, together with (26), yield

(27)
(28)
(29)

where (27) and (28) are due to [26, Exercise 10.10]. By (29), we have

(30)

Moreover, we have for all from (25). This, together with and (16), yields for all . By this, , and (30), we have

which results in for all . This further leads to

By the definitions of and , the above yields

which is a contradiction since and for all . Therefore, there exists a scalar such that (21) holds, which implies that statement (ii) holds.

We next prove . Suppose that (17) holds with some . Since is uniformly continuous on , there exists a scalar such that

(31)

Let , be arbitrarily chosen, and for . Using the same arguments as those for (19), one has

(32)

By (17), (32), and the integral form of Taylor’s series, we obtain

Applying the Cauchy-Schwarz inequality and using , the above yields

Therefore, statement (i) holds as well.∎

Remark. When is convex, reduces to the set of optimal solutions to and it is known that the EB condition (16) is equivalent to the quadratic growth condition (17); see, e.g., [1]. When is nonconvex, Drusvyatskiy et al. [12] studied these two regularity conditions for the set of first-order critical points (replacing in both (16) and (17) by the set of first-order critical points) and proved that they are equivalent under an additional quadratic decrease condition; see [12, Theorem 3.1]. Our Theorem 1 is motivated by [12, Theorem 3.1] and shows that for the set of second-order critical points of a twice continuously differentiable function, the EB condition (16) and the quadratic growth condition (17) are equivalent without requiring the said additional condition.

Corollary 1.

Suppose that Assumption 2 and the premise of Theorem 1 hold. Then, any second-order critical point of is a local minimizer.

Proof.

Let be an arbitrary second-order critical point of . By Theorem 1 and Assumption 2, the quadratic growth condition (17) holds for some . Let and be an arbitrary point in . It then follows from (17) that . Moreover, it holds that . By this and the separation property in Theorem 1, we have . Hence, we obtain for all , which implies that is a local minimizer of . ∎

For the rest of this section, we show that the classes of functions considered in Facts 3 and 4 satisfy Assumption 2.

Proposition 1.

Suppose that is star-convex, is globally non-degenerate, and is uniformly continuous on for some . Then, satisfies Assumption 2.

Proof.

We first show that for star-convex functions, the set of second-order critical points equals the set of optimal solutions; i.e., . Since it is clear that , it suffices to show that . Suppose on the contrary that for some . Hence, and for any . By this and (12), we have that for any ,

which contradicts with . Hence, we obtain . This, together with our assumption in Proposition 1, implies that is uniformly continuous on for some . Also, since , we have for any , which implies that the separation property in Theorem 1 holds. Moreover, by and the assumption that is globally non-degenerate, statement (ii) of Theorem 1 holds. Hence, statement (i) of Theorem 1 holds as well, which implies that satisfies Assumption 2. ∎

Proposition 2.

Suppose that is gradient-dominated of degree and is uniformly continuous on for some . Then, satisfies Assumption 2.

Proof.

Due to (14), one can see that for any , we have , which immediately implies that . This, together with , yields . It then follows from the same arguments as those in the proof of Proposition 1 that the premise of Theorem 1 holds. Our next goal is to prove

(33)

Notice that (33) holds trivially for . Let be arbitrarily chosen. Consider the differential equation

(34)

Since is continuously differentiable on , it is Lipschitz continuous on any compact subset of . It then follows from the Picard-Lindelöf Theorem (see, e.g., [16, Theorem II.1.1]) that there exists a such that (34) has a unique solution