The Nonconvex Geometry of Low-Rank Matrix Optimizations with General Objective Functions

The Nonconvex Geometry of Low-Rank Matrix Optimizations with General Objective Functions

Qiuwei Li and Gongguo Tang
Department of Electrical Engineering and Computer Science,
Colorado School of Mines, Golden, CO 80401
Q. Li and G. Tang were supported by the NSF Grant CCF-1464205.
September 13, 2019
Abstract

This work considers the minimization of a general convex function over the cone of positive semi-definite matrices whose optimal solution is of low-rank. Standard first-order convex solvers require performing an eigenvalue decomposition in each iteration, severely limiting their scalability. A natural nonconvex reformulation of the problem factors the variable into the product of a rectangular matrix with fewer columns and its transpose. For a special class of matrix sensing and completion problems with quadratic objective functions, local search algorithms applied to the factored problem have been shown to be much more efficient and, in spite of being nonconvex, to converge to the global optimum. The purpose of this work is to extend this line of study to general convex objective functions and investigate the geometry of the resulting factored formulations. Specifically, we prove that when satisfies restricted strong convexity and smoothness, each critical point of the factored problem either corresponds to the optimal solution or is a strict saddle point where the Hessian matrix has a negative eigenvalue. Such a geometric structure of the factored formulation ensures that many local search algorithms can converge to the global optimum with random initializations.

1 Introduction

Consider a general semi-definite program (SDP) where a convex objective function is minimized over the cone of positive semi-definite (PSD) matrices:

(1.1)

For this problem, even fast first-order methods such as the projected gradient descent algorithm [18, 25] require performing an expensive eigenvalue decomposition in each iteration. These expensive operations form the major computational bottleneck of the algorithms and prevent them from scaling to scenarios with millions of variables, a typical situation in a diverse of applications, including phase retrieval [20, 64], quantum state tomography [1, 33], user preferences prediction [29, 53, 57, 65], and pairwise distances estimation in sensor localization [13, 14].

When the SDP (1.1) admits a low-rank solution , in their pioneer work [19], Burer and Monteiro proposed to factorize the variable , where with , and solved a factored nonconvex problem

(1.2)

For standard SDPs with a linear objective function and several linear constraints, they also argued that when the factorization is overparameterized, i.e., , any local minimum of (1.2) corresponds to the solution , provided some regularity conditions are satisfied. Unfortunately, these regularity conditions are generally hard to verify for specific SDPs arising in applications. Recent work [16] removed these regularity conditions and showed that the factored objective function almost never has any spurious local optima for general linear objective functions. Our work differs in that the convex objective function is generally not linear and there are no additional linear constraints. In addition to showing the nonexistence of spurious local minima, we also demonstrate that critical points that do not correspond to global optima are strict saddle points, ensuring the global convergence of simple gradient descent algorithms.

A special class of optimizations that admit low-rank solutions stem from regularizing matrix inverse problems using the trace or nuclear norm [52], which have found numerous applications in machine learning [34], signal processing [17], and control [45]. The statistical performance of such optimizations in recovering a low-rank matrix have been studied extensively in literature using convex analysis techniques [23]. For example, it has information-theoretically optimal sampling complexity [24], achieves minimax denoising rate [21] and satisfies tight oracle inequalities [22]. In spite of its optimal performance, trace norm regularization cannot be scaled to solve the practical problems that originally motivate its development even with specialized first-order algorithms. This was realized since the advent of this field and low-rank factorization method was proposed as an alternative to convex solvers [19]. When coupled with stochastic gradient descent, low-rank factorization leads to state-of-the-art performance in practical matrix completion problems [32, 68].

The past two years have seen renewed interest in the Burer-Monterio factorization for solving trace norm regularized inverse problems. With technical innovations in analyzing the nonconvex landscape of the factored objective function, several recent works have shown that with exact parameterization (i.e., ) the factored objective function in trace norm regularized matrix inverse problems has no spurious local minima or degenerate saddle points [11, 12, 32, 39, 62]. An important implication is that local search algorithms such as gradient descent and its variants are able to converge to the global optimum with even random initialization [12].

We generalize this line of work by assuming a general objective function in the optimization (1.1), not necessarily coming from a matrix inverse problem. The generality allows us to view the factored problem (1.2) as a way to solve the convex optimization (1.1) to global optimality, rather than a new modeling method. This perspective, also taken by Burer and Monterio in their original work, frees us from rederiving the statistical performances of the factored optimization (1.2). Instead, its performance inherits from that of the convex optimization (1.1), whose performance can be developed using a suite of powerful convex analysis techniques accumulated from several decades of research. As a specific example, the optimal sampling complexity and minimax denoising rate of trace norm regularization need not to be rederived once one knows the equivalence between the convex and the factored formulations. In addition, our general analysis technique also sheds light on the connection between the geometries of the convex program (1.1) and its nonconvex counterpart (1.2) as discussed in Section 3.

Our governing assumption on the objective function is -restricted -strong convexity and -smoothness. More precisely, the Hessian of satisfies

(1.3)

for some positive numbers and and any PSD matrix with . Here is an identity matrix of appropriate sizes. This assumption is standard in matrix inverse problem [4, 47]. We show that under this assumption, each critical point of the factored objective function either corresponds to the low-rank global solution of the original convex program (1.1) or is a strict saddle point where the Hessian has a strictly negative eigenvalue. These results are summarized in the following theorem:

Theorem 1.

Suppose the function in (1.1) is twice continuously differentiable and satisfies -restricted -strong convexity and -smoothness condition (1.3) with positive numbers and satisfying

(1.4)

Assume is an optimal solution of the minimization (1.1) with . Set in (1.2). Let be any critical point of satisfying . Then either corresponds to a square-root factor of , i.e.,

(1.5)

or is a strict saddle point of the factored problem (1.2). More precisely, let such that and set with , then the curvature of along is strictly negative:

(1.6)

with and . This further implies

(1.7)

Several remarks follow. First, the matrix is the direction from the saddle point to its closest global optimal factor of the same size as . Second, we can simplify the expression in (1.7) depending on the relative values of of and :

For all these cases, is strictly positive, implying that is a strict saddle point. Note that our result covers both over-parameterization where and exact parameterization where . Third, we can recover the rank- global minimizer of (1.1) by running local search algorithms on the factored function if we know an upper bound on the rank . The strict saddle property ensures that many iterative algorithms, for example, stochastic gradient descent [31], trust-region method [58, 60], and gradient descent with sufficiently small stepsize [41], all converge to a square-root factor of , even with random initialization. Last but not least, our main result only relies on the restricted strong convexity and smoothness property. Therefore, in addition to low-rank matrix recovery problems [67, 22, 37] and phase retrieval [60, 55, 20, 50], it is also applicable to many other low-rank matrix optimization problems with non-quadratic objective functions, including -bit matrix completion also known as the logistic PCA [28, 40], robust PCA with complex noise [48, 66], Poisson PCA [54], and other low-rank models with generalized loss functions [63]. For SDPs with additional linear constraints, as those studied in [19, 16], one can combine the original objective function with a least-squares term that penalizes the deviation from the linear constraints. As long as the penalization parameter is large enough, the solution is equivalent to that of the standard SDP and hence is also covered by our main theorem.

We end this section by introducing some notations used throughout the paper. Denote as the collection of all positive integers up to . The symbols and are reserved for the identity matrix and zero matrix/vector, respectively. A subscript is used to indicate its size when this is not clear from context. We call a matrix PSD, denoted by , if all its eigenvalues are nonnegative. The notation means , i.e., is PSD. The set of orthogonal matrices is denoted by . Matrix norms such as the spectral, nuclear, and Frobenius norms are denoted respectively by , and .

The gradient of a scalar function with a matrix variable is an matrix, whose th entry is for , . Alternatively, we can view the gradient as a linear form for any . The Hessian of can be viewed as a th order tensor of size , whose th entry is for , . Similar to the linear form representation of the gradient, we can view the Hessian as a bilinear form defined via for any . Yet another way to represent the Hessian is as an matrix for , where is the th entry of the vectorization of . We will use these representations interchangeably whenever the specific form can be inferred from context. For example, in the strong convexity and smoothness condition (1.3), the Hessian is apparently viewed as an matrix and the identity is of size

For a matrix-valued function , it is notationally easier to represent its gradient (or Jacobian) and Hessian as multi-linear operators. For example, the gradient, as a linear operator from to , is defined via for and ; the Hessian, as a bilinear operator from to , is defined via for and . Using this notation, the Hessian of the scalar function of the previous paragraph, which is also the gradient of , can be viewed as a linear operator from to denoted by and satisfies for .

2 Problem Formulations and Preliminaries

This paper considers the problem (1.1) of minimizing a convex function over the PSD cone. Let be an optimal solution of (1.1) of rank . When the PSD variable is reparameterized as

where with is a rectangular, matrix square root of , the convex program is transformed into the factored problem (1.2) whose objective function is . Inspired by the lifting technique in constructing SDP relaxations, we refer to the variable as the lifted variable, and the variable as the factored variable. Similar naming conventions apply to the optimization problems, their domains, and objective functions.

The nonlinear parametrization makes a nonconvex function and introduces additional critical points (i.e., those with that are not global optima of the factored optimization (1.2)). Our goal is to show that the critical points either correspond to or are strict saddle points where the Hessian has a strictly negative eigenvalue.

2.1 Metrics in the Lifted and Factored Spaces

Since for any , where , the domain of the factored objective function is stratified into equivalent classes and can be viewed as a quotient manifold [2]. The matrices in each of these equivalent classless differ by an orthogonal transformation (not necessarily unique when the rank of is less than ). One implication is that, when working in the factored space, we should consider all factorizations of

A second implication is that when considering the distance between two points and , one should use the distance between their corresponding equivalent classes:

(2.1)

where the second equality follows from the rotation invariance of . Under this notation, represents the distance between the class containing a critical point and the optimal factor class . The second minimization problem in the definition (2.1) is known as the orthogonal Procrustes problem, whose optimal is characterized by the following lemma:

Lemma 1.

[35] An optimal solution for the orthogonal Procrustes problem:

is given by , where the orthogonal matrices are defined via the singular value decomposition of . Moreover, we have and .

For any two matrices , the following lemma proved in Section A relates the distance in the lifted space to the distance in the factored space.

Lemma 2.

Assume that has ranks , respectively. Then

where denotes the -th largest singular value and .

2.2 Consequences of Restricted Strong Convexity

The parameterization introduces nonconvexity into the factored problem (1.2). We are interested in studying how it transforms the landscape of the lifted objective function . We make the assumption that the landscape of in the lifted space is bowl-shaped, at least along matrices of rank at most , as indicated by the -restricted strong convexity and smoothness assumption (1.3). An immediate consequence of this assumption is that if (1.1) has an optimal solution with , then there is no other optimum of (1.1) with rank less than or equal to :

Proposition 1.

Suppose the function is twice continuously differentiable and satisfies -restricted -strongly convexity condition in (1.3). Assume is an optimum of the minimization (1.1) with . Then is the unique global optimum of (1.1) of rank at most .

Proof.

We prove it by contradiction. Suppose there exists another optimum of (1.1) with and . Then the second order Taylor’s expansion reads

where for some and evaluates the Hessian bilinear form along the direction . The KKT conditions for the convex optimization (1.1) states that and , implying that the second term in the above Taylor expansion since is feasible. Further, since is PSD and has , by the -restricted -strongly convexity assumption (1.3), we have

Combining all, we get

since , which is a contradiction. ∎

The restricted strong convexity and smoothness assumption (1.3) reduces to the Restricted Isometry Property (RIP) when the objective function is quadratic. This is apparent from the following equivalent form of the assumption:

(2.2)

that holds for any PSD matrix of rank at most . This further implies a restricted orthogonality property:

()

that again holds for any PSD of rank at most . Similar to the standard RIP, the also claims that the operator , when evaluated on a restricted set of low-rank matrices, preserves geometric structures.

3 Transforming the Convex Landscape

Our primary interest is to understand how the landscape of the lifted objective function is transformed by the factored parameterization , particularly how its global optimum is mapped to the factored space, how other types of critical points are introduced, and what are their properties. As a constrained convex optimization, all critical points of (1.1) are global optima and are characterized by the necessary and sufficient KKT condition [18]:

(3.1)

Proposition 1 further shows that, as a consequence of the -restricted strong convexity, such global optimum is unique among all PSD matrices of rank at most . The factored optimization (1.2) is unconstrained, whose critical points are specified by the zero gradient condition:

(3.2)

To classify the critical points, we compute the Hessian bilinear form as:

(3.3)

Roughly speaking, the Hessian quadratic form has two terms – the first term involves the gradient of and the Hessian of , while the second term involves the Hessian of and the gradient of . Since , the gradient of is the linear operator and the Hessian bilinear operator applies as . Note in (3.3) the second quadratic form is always nonnegative since due to the convexity of .

For any critical point of , the corresponding lifted variable is PSD and satisfies . On one hand, if further satisfies , then in view of the KKT conditions (3.1) and noting , we must have , the global optimum of (1.1). On the other hand, if , implying due to the necessity of (3.1), then additional critical points can be introduced into the factored space. Fortunately, also implies that the first quadratic form in (3.3) might be negative for a properly chosen direction . To sum up, the critical points of can be classified into two categories: the global optima in the optimal factor set with and those with . For the latter case, by choosing a proper direction , we will argue that the Hessian quadratic form (3.3) has a strictly negative eigenvalue, and hence moving along in a short distance will decrease the value of , implying that they are strict saddle points and are not local minima.

We argue that a good choice of is the direction from current to its closest point in the optimal factor set . Formally, where is the optimal rotation for the orthogonal Procrustes problem. Plugging into the first term of (3.3), we simplify it as

(3.4)

where in the second inequality the last three terms involving were canceled and in the last equality the term was reintroduced both due to the critical point property . To build intuition on why (3) is negative while the second term in (3.3) remains small, we consider a simple example: the matrix Principal Component Analysis (PCA) problem.

Example 1.

Matrix PCA Problem. Consider the PCA problem for symmetric PSD matrices:

(3.5)

where is a symmetric PSD matrix of rank . Apparently, the optimal solution is . Now consider the factored problem:

where satisfies . Our goal is to show that any critical point such that is a strict saddle point. Since , by (3), the first term of in (3.3) becomes

(3.6)

is strictly negative.

The second term of is since . We next argue that . For this purpose, let be the eigenvalue decomposition of , where has orthonormal columns and is composed of positive entries. Similarly, let be the eigenvalue decomposition of , where . The critical point satisfies , implying that

This means forms an eigenvalue-eigenvector pair of for each . Consequently, and and we can write . Here is equal to either 0 or 1 indicating which of the eigenvalue-eigenvector pair appears in the decomposition of . Without loss of generality, we can choose . Then for some orthonormal matrix and . By Lemma 1, we get . Plugging these into gives .

Hence is simply determined by its first term

where the second line follows from Lemma 2 with . This further implies

This simple example is ideal in several ways, particularly the gradient , which directly establishes the negativity of the first term in (3.3); and by choosing and using , the second term vanishes. Both are not true any more for general objective functions . However, the example does suggest that the direction is a good choice to show , which we will continue to use in Section 4 to prove Theorem 1.

4 Proof of Theorem 1

Proof Outline. We present a formal proof of Theorem 1 in this section. The main arguments involve showing each critical point of either corresponds to the optimal solution or is a strict saddle point of . Being a strict saddle of means the Hessian matrix has at least one strictly negative eigenvalue. Inspired by the discussions in Section 3, we will use the direction and show that the Hessian has a strictly negative curvature along : i.e.,

4.1 Supporting Lemmas

We first list two lemmas. Lemma 3 separates into two parts: and with being the projection matrix onto . It is crucial for the first part to have a small coefficient. In lemma 4, we will further control the second part as a consequence of being a critical point. The proof of Lemma 3 is given in Section B.

Lemma 3.

Let and be any two matrices in such that is PSD. Assume is an orthogonal matrix whose columns span . Then

We remark that Lemma 3 is a strengthened version of [12, Lemma 4.4]. While the result there requires: (i) to be a critical point of the factored objective function ; (ii) is a optimal factor in that is closest to , i.e., with and . Lemma 3 removes these assumptions and requires only being PSD.

Next, we control the distance between and the global solution of (1.1) when is a critical point of the factored objective function , i.e., . The proof, given in Section C, relies on writing and applying the of the Hessian matrix.

Lemma 4 (Upper bound on ).

Suppose the objective function in (1.1) is twice continuously differentiable and satisfies -restricted -strongly convexity and -smoothness condition (1.3). Further, let be any critical point of (1.2) and be the orthonormal basis spanning . Then

4.2 The Formal Proof

Now, we are ready to prove the main theorem.

Proof of Theorem 1.

By Section 3, it suffices to find a direction to produce a strictly negative curvature for each critical point not corresponding to . We choose where , and . Then according to (3.3), we have

(4.1)

(i), as in (3), follows from since is a critical point of . Next, we bound , separately.

Bound .

(i) follows from the integral form of the mean value theorem for vector-valued functions (see [49, Eq. (A.57)]); (ii) follows from the restricted strong convexity assumption (1.3) since the PSD matrix has rank at most

Bound .

follows from the optimality condition for the convex optimization (1.1) (see, e.g., [18, Section 4.2.3]) and the fact that is optimal while is feasible.

Bound .

following from the restricted smoothness assumption (1.3) since and Recognizing by Lemma 1, we invoke Lemma 3 to bound as

Plugging to (4.1), we obtain that

where (i) follows from Lemma 4; (ii) holds for (iii) follows from Lemma 2 with As a consequence, we obtain

5 Prior Art and Inspirations

The past few years have seen a surge of interest in nonconvex reformulations of convex optimizations for efficiency and scalability reasons. Several convex optimizations of practical importance in machine learning, signal processing, and statistical problems, can be naturally formulated into nonconvex forms [39, 12, 11, 20, 37, 61, 44, 58]. Compared with the corresponding convex formulations, the nonconvex forms typically involve much fewer variables, enabling simple algorithms (e.g., gradient descent [41, 31], trust-region method [60, 58], alternating optimization [6, 15]) to scale to large-scale applications.

Although these nonconvex reformulations have been known to work surprisingly well in practice, it remains an active research area to fully understand the theoretical underpinning of this phenomenon, particularly the geometrical structures of these nonconvex optimizations. The objective functions of convex optimizations have simple landscapes so that local minimizers are always global ones. However, the landscapes of general nonconvex functions can become as complicated as it could be. Even certifying the local optimality of a point for general functions is an NP-hard problem [46]. The existence of spurious local minima that are not global optima is a common issue [56, 30]. In addition, degenerate saddle points or those surrounded by plateaus of small curvature also prevent local search algorithms from converging quickly to local optima [27].

Fortunately, for a range of convex optimizations, particularly those involving low-rank matrices, the corresponding nonconvex reformulations have nice geometric structures that allow local search algorithms to converge to global optimality [59]. Examples include low-rank matrix factorization, completion and sensing [39, 12, 11, 61, 67], tensor decomposition and completion [31, 5, 6, 38], structured element pursuit [51, 36], dictionary learning [7, 9, 8, 10, 58, 3], blind deconvolution [43, 42], phase retrieval [60, 20, 26], and many more. Based on whether smart initializations are needed, these previous works can be roughly classified into two categories. In one case, the algorithms require a problem-dependent initialization plus local refinement. A good initialization can lead to global convergence if the initial iterate lies in the attraction basins of the global optimal [39, 11, 61, 5, 6, 38, 9, 8, 10, 3, 20, 26]. For low-rank matrix recovery problems, such initializations can be obtained using spectral methods [11, 39, 11, 61, 67]; for other problems, it is more difficult to find an initial point located in the attraction basin [7, 9, 5]. The second category of works attempt to understand the empirical success of simple algorithms such as gradient descent, which converge to global optimality even with random initialization [41, 12, 31, 60, 58]. This is achieved by analyzing the objective function’s landscape and showing that they have no spurious local minima and no degenerate saddle point. Most of the works in the second category are for specific matrix sensing problems with quadratic objective functions. Our work expands this line of geometry-based convergence analysis by considering low-rank matrix optimization problems with general objective functions.

This research draws inspirations from several previous works. In [11], the authors also considered low-rank matrix optimizations with general objective functions. They characterized the local landscape around the global optima, and hence their algorithms require good initializations for global convergence. We instead characterize the global landscape by categorizing all critical points into global optima and strict saddle points. This guarantees that several local search algorithms with random initialization will converge to the global optima. Another closely related work is low-rank, PSD matrix recovery from linear observations by minimizing the factored quadratic objective function [12]. As we discussed in Section 3, low-rank and PSD matrix recovery from linear measurements is a special cases of our general objective function framework. Furthermore, by relating the first order optimality condition of the factored problem with the global optimality of the original convex program, our work provides a more transparent relationship between geometries of these two problems and greatly simplifies the theoretical argument. More recently, the authors of [16] showed that for general SDPs with linear objective functions and linear constraints, the factored problems have no spurious local minimizers. However, they did not characterize the saddle points and also did not allow nonlinear objective functions.

6 Conclusion

This work investigates the minimization of a convex function over the cone of PSD matrices. To improve computational efficiency, we focus on a natural factored formulation of the original convex problem which explicitly encodes the PSD constraint. We prove that the factored problem, in spite of being nonconvex, has the following benign landscape: each critical point is either a factor of the global optimal solution to the original convex program, or a strict saddle where the Hessian matrix has a strictly negative eigenvalue. The geometric characterization of the factored objective function guarantees that many local search algorithms applied to the factored objective function converge to a global minimizer with random initializations.

References