Dropping Convexity for Faster Semi-definite Optimization
We study the minimization of a convex function over the set of positive semi-definite matrices, but when the problem is recast as , with and . We study the performance of gradient descent on —which we refer to as Factored Gradient Descent (Fgd)—under standard assumptions on the original function .
We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of Fgd mirrors that of standard gradient descent on the original : i.e., after steps, the error is for smooth , and exponentially small in when is (restricted) strongly convex. In addition, we provide a procedure to initialize Fgd for (restricted) strongly convex objectives and when one only has access to via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees.
Fgd and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.
Consider the following standard convex semi-definite optimization problem:
where is a convex and differentiable function, and denotes the convex set over positive semi-definite matrices in . Let be an optimum of (1) with . This problem can be remodeled as a non-convex problem, by writing where is an matrix. Specifically, define and111While is a non-convex function, we note that it is a very specific kind of non-convexity, arising “only” due to the recasting of an originally convex function. consider direct optimization of the transformed problem, i.e.,
Problems (1) and (2) will have the same optimum when . However, the recast problem is unconstrained and leads to computational gains in practice: e.g., iterative update schemes, like gradient descent, do not need to do eigen-decompositions to satisfy semi-definite constraints at every iteration.
In this paper, we also consider the case of , which often occurs in applications. The reasons of such a choice chould be three-fold: it might model better an underlying task (e.g., may have arisen from a relaxation of a rank constraint in the first place), it leads to computational gains, since smaller means fewer variables to maintain and optimize, it leads to statistical “gains”, as it might prevent over-fitting in machine learning or inference problems.
Such recasting of matrix optimization problems is empirically widely popular, especially as the size of problem instances increases. Some applications in modern machine learning includes matrix completion [21, 42, 49, 22], affine rank minimization [65, 41, 8], covariance / inverse covariance selection [38, 50], phase retrieval [64, 18, 77, 68], Euclidean distance matrix completion , finding the square root of a PSD matrix , and sparse PCA , just to name a few. Typically, one can solve (2) via simple, first-order methods on like gradient descent. Unfortunately, such procedures have no guarantees on convergence to the optima of the original , or on the rate thereof. Our goal in this paper is to provide such analytical guarantees, by using—simply and transparently—standard convexity properties of the original .
Overview of our results. In this paper, we prove that updating via gradient descent in (2) converges (fast) to optimal (or near-optimal) solutions. While there are some recent and very interesting works that consider using such non-convex parametrization [42, 64, 71, 82, 69, 81], their results only apply to specific examples. To the best of our knowledge, this is the first paper that solves the re-parametrized problem with attractive convergence rate guarantees for general convex functions and under common convex assumptions. Moreover, we achieve the above by assuming the first order oracle model: for any matrix , we can only obtain the value and the gradient .
To achieve the desiderata, we study how gradient descent over performs in solving (2). This leads to the factored gradient descent (Fgd) algorithm, which applies the simple update rule
We provide a set of sufficient conditions to guarantee convergence. We show that given a suitable initialization point, Fgd converges to a solution close to the optimal point in sublinear or linear rate, depending on the nature of .
Our contributions in this work can be summarized as follows:
New step size rule and Fgd. Our main algorithmic contribution is a special choice of the step size . Our analysis showcase that needs to depend not only on the convexity parameters of (as is the case in standard convex optimization) but also on the top singular value of the unknown optimum. Section 3 describes the precise step size rule, and also the intuition behind it. Of course, the optimum is not known a priori. As a solution in practice, we show that choosing based on a point that is constant relative distance from the optimum also provably works.
Convergence of Fgd under common convex assumptions. We consider two cases: when is just a -smooth convex function, and when satisfies also restricted strong convexity (RSC), i.e., satisfies strong-convexity-like conditions, but only over low rank matrices; see next section for definitions. Both cases are based on now-standard notions, common for the analysis of convex optimization algorithms. Given a good initial point, we show that, when is -smooth, Fgd converges sublinearly to an optimal point . For the case where has RSC, Fgd converges linearly to the unique , matching analogous result for classic gradient descent schemes, under smoothness and strong convexity assumptions.
Furthermore, for the case of smooth and strongly convex , our analysis extends to the case , where Fgd converges to a point close to the best rank- approximation of .222In this case, we require to be small enough, such that the rank-constrained optimum be close to the best rank- approximation of . This assumption naturally applies in applications, where e.g., is a superposition of a low rank latent matrix, plus a small perturbation term [43, 78]. In Section I, we show how this assumption can be dropped by using a different step size , where spectral norm computation of two matrices is required per iteration.
Both results hold when Fgd is initialized at a point with constant relative distance from optimum. Interestingly, the linear convergence rate factor depends not only on the convexity parameters of , but also on the spectral characteristics of the optimum; a phenomenon borne out in our experiments. Section 4 formally states these results.
Initialization: For specific problem settings, various initialization schemes are possible (see [42, 64, 23]). In this paper, we extend such results to the case where we only have access to via the first-order oracle: specifically, we initialize based on the gradient at zero, i.e., . We show that, for certain condition numbers of , this yields a constant relative error initialization (Section 5). Moreover, Section 5 lists alternative procedures that lead to good initialization points and comply with our theory.
The rest of the paper is organized as follows. Section 2 contains basic notation and standard convex definitions. Section 3 presents the Fgd algorithm and the step size used, along with some intuition for its selection. Section 4 contains the convergence guarantees of Fgd; the main supporting lemmas and proofs of the main theorems are provided in Section 6. In Section 5, we discuss some initialization procedures that guarantee a “decent” starting point for Fgd. This paper concludes with discussion on related work (Section 7).
Notation. For matrices , their inner product is . Also, denotes is a positive semi-definite (PSD) matrix, while the convex set of PSD matrices is denoted . We use and for the Frobenius and spectral norms of a matrix, respectively. Given a matrix , we use and to denote the smallest and largest strictly positive singular values of and define ; with a slight abuse of notation, we also use . denotes the rank- approximation of via its truncated singular value decomposition. Let denote the condition number of ; again, observe . denotes the basis of the column space of matrix . represents the stable rank of matrix . We use to denote the standard basis vector with 1 at the -th position and zeros elsewhere.
Without loss of generality, is a symmetric convex function, i.e., . Let denote the gradient matrix, i.e., its element is . For , the gradient of with respect to is , due to symmetry of . Finally, let be the optimum of over with factorization .
For any general symmetric matrix , let the matrix be its projection onto the set of PSD matrices. This can be done by finding all the strictly positive eigenvalues and corresponding eigenvectors and then forming .
In algorithmic descriptions, and denote the putative solution of current and next iteration, respectively. An important issue in optimizing over the space is the existence of non-unique possible factorizations for any feasible point . To see this, given factorization where , one can define an class of equivalent factorizations , where belongs to the set of rotational matrices. So we use a rotation invariant distance metric in the factored space that is equivalent to distance in the matrix space, which is defined below.
Let matrices . Define:
is the set of orthonormal matrices , such that . The optimal satisfies where is the singular value decomposition of .
Assumptions. We will investigate the performance of non-convex gradient descent for functions that satisfy standard smoothness conditions only, as well as the case where further is (restricted) strongly convex. We state these standard definitions below.
Let be convex and differentiable. Then, is -strongly convex if:
Let be a convex differentiable function. Then, is -smooth if:
This further implies the following upper bound:
Given the above definitions, we define as the condition number of function .
Finally, in high dimensional settings, often loss function does not satisfy strong convexity globally, but only on a restricted set of directions; see [59, 2] and Section F for a more detailed discussion.
A convex function is -restricted strongly convex if:
3 Factored gradient descent
We solve the non-convex problem (2) via Factored Gradient Descent (Fgd) with update rule333The true gradient of with respect to is . However, for simplicity and clarity of exposition, in our algorithm and its theoretical guarantees, we absorb the 2-factor in the step size .:
Fgd does this, but with two key innovations: a careful initialization and a special step size . The discussion on the initialization is deferred until Section 5.
Step size .
Even though is a convex function over , the fact that we operate with the non-convex parametrization means that we need to be careful about the step size ; e.g., our constant selection should be such that, when we are close to , we do not “overshoot” the optimum .
In this work, we pick the step size parameter, according to the following closed-form444Constant in the expression (8) appears due to our analysis, where we do not optimize over the constants. One can use another constant in order to be more aggressive; nevertheless, we observed that our setting works well in practice. :
Recall that, if we were just doing standard gradient descent on , we would choose a step size of , where is a uniform upper bound on the largest eigenvalue of the Hessian .
To motivate our step size selection, let us consider a simple setting where with ; i.e., is a vector. For clarity, denote it as . Let be a separable function with . Furthermore, define the function such that . It is easy to compute (see Lemma E.1):
where and, are the matricization, vectorization and diagonalization operations, respectively; for the last case, diag generates a diagonal matrix from the input, discarding its off-diagonal elements. We remind that and . Note also that is diagonal for separable .
Standard convex optimization suggests that should be chosen such that . The above suggest the following step size selection rule for -smooth :
In stark contrast with classic convex optimization where , the step size selection further depends on the spectral information of the current iterate and the gradient. Since computing per iteration could be computational inefficient, we use the spectral norm of and its gradient as surrogate, where is the initialization point555However, as we show in Section I, one could compute and per iteration in order to relax some of the requirements of our approach..
To clarify selection further, we next describe a toy example, in order to illustrate the necessity of such a scaling of the step size. Consider the following minimization problem.
where —and thus, , i.e., we are interested in rank-1 solutions—and is a given rank-2 matrix such that , for and orthonormal vectors. Observe that is a strongly convex function with rank-1 minimizer ; let . It is easy to verify that , , and , where .
Consider the case where is the current estimate. Then, the gradient of at is evaluated as:
Hence, according to the update rule , the next iterate satisfies:
Observe that coefficients of both in include and quantities.
The quality of clearly depends on how is chosen. In the case , such step size can result in divergence“overshooting”, as and can be arbitrarily large (independent of ). Therefore, it could be the case that .
In contrast, consider the step size666For illustration purposes, we consider a step size that depends on the unknown ; in practice, our step size selection is a surrogate of this choice and our results automatically carry over, with appropriate scaling. . Then, with appropriate scaling , we observe that lessens the effect of and terms in and terms, that lead to overshooting for the case . This most possibly result in .
The per iteration complexity of Fgd is dominated by the gradient computation. This computation is required in any first order algorithm and the complexity of this operation depends on the function . Apart from , the additional computation required in Fgd is matrix-matrix additions and multiplications, with time complexity upper bounded by , where denotes the number of non zeros in the gradient at the current point.777It could also occur that gradient is low-rank, or low-rank + sparse, depending on the problem at hand; it could also happen that the structure of leads to “cheap” matrix-vector calculations, when applied to vectors. Here, we state a more generic –and maybe pessimistic– scenario where is unstructured. Hence, the per iteration complexity of Fgd is much lower than traditional convex methods like projected gradient descent  or classic interior point methods [62, 63], as they often require a full eigenvalue decomposition per step.
Note that, for , Fgd and projected gradient descent have same per iteration complexity of . However, Fgd performs only a single matrix-matrix multiplication operation, which is much “cheaper” than a SVD calculation. Moreover, matrix multiplication is an easier-to-parallelize operation, as opposed to eigen decomposition operation which is inherently sequential. We notice this behavior in practice; see Sections F, G and H for applications in matrix sensing and quantum state tomography.
4 Local convergence of Fgd
In this section, we present our main theoretical results on the performance of Fgd. We present convergence rates for the settings where is a -smooth convex function, and is a -smooth and -restricted strongly convex function. These assumptions are now standard in convex optimization. Note that, since the factorization makes the problem non-convex, it is hard to guarantee convergence of gradient descent schemes in general, without any additional assumptions.
We now state the main assumptions required by Fgd for convergence:
Initialization: We assume that Fgd is initialized with a “good” starting point that has constant relative error to .888If , then one can drop the subscript. For completeness and in order to accommodate the approximate rank- case, described below, we will keep the subscript in our discussion. In particular, we assume
for (Smooth )
for (Strongly convex ),
for the smooth and restricted strongly convex setting, respectively. This assumption helps in avoiding saddle points, introduced by the parametrization999To illustrate this consider the following example, Now it is easy to see that and is a stationary point of the function considered . We need the initial error to be further smaller than by a factor of condition number of ..
In many applications, an initial point with this type of guarantees is easy to obtain, often with just one eigenvalue decomposition; we refer the reader to the works [42, 64, 23, 82, 71] for specific initialization procedures for different problem settings. See also Section 5 for a more detailed discussion. Note that the problem is still non-trivial after the initialization, as this only gives a constant error approximation.
Approximate rank- optimum: In many learning applications, such as localization  and multilabel learning , the true emerges as the superposition of a low rank latent matrix plus a small perturbation term, such that is small. While, in practice, it might be the case —due to the presence of noise—often we are more interested in revealing the latent low-rank part. As already mentioned, we might as well set for computational or statistical reasons. In all these cases, further assumptions w.r.t. the quality of approximation have to be made. In particular, let be the optimum of (1) and is -smooth and -strongly convex. In our analysis, we assume:
(Strongly convex ),
This assumption intuitively requires the noise magnitude to be smaller than the optimum and constrains the rank constrained optimum to be closer to .101010Note that the assumption can be dropped by using a different step size (see Theorem I.4 in Section I). However, this requires two additional spectral norm computations per iteration.
We note that, in the results presented below, we have not attempted to optimize over the constants appearing in the assumptions and any intermediate steps of our analysis. Finding such tight constants could strengthen our arguments for fast convergence; however, it does not change our claims for sublinear or linear convergence rates. Moreover, we consider the case ; we believe the analysis can be extended to the setting and leave it for future work.111111Experimental results on synthetic matrix sensing settings have shown that, if we overshoot , i.e., , Fgd still performs well, finding an -accurate solution with linear rate.
4.1 convergence rate for smooth
Next, we state our first main result under smoothness condition, as in Definition 2.3. In particular, we prove that Fgd makes progress per iteration with sublinear rate. Here, we assume only the case where ; for consistency reasons, we denote . Key lemmas and their proofs for this case are provided in Section C.
Theorem 4.1 (Convergence performance for smooth ).
Let denote an optimum of -smooth over the PSD cone. Let . Then, under assumption , after iterations, the FGD algorithm finds solution such that
The theorem states that provided we choose the step size , based on a starting point that has constant relative distance to , and we start from such a point, gradient descent on will converge sublinearly to a point . In other words, Theorem 4.1 shows that Fgd computes a sequence of estimates in the -factor space such that the function values decrease with rate, towards a global minimum of function. Recall that, even in the standard convex setting, classic gradient descent schemes over achieve the same convergence rate for smooth convex functions . Hence, Fgd matches the rate of convex gradient descent, under the assumptions of Theorem 4.1. The above are abstractly illustrated in Figure 1.
4.2 Linear convergence rate under strong convexity assumption
Here, we show that, with the additional assumption that satisfies the -restricted strong convexity over , Fgd achieves linear convergence rate. The proof is provided in Section B.
Theorem 4.2 (Convergence rate for restricted strongly convex ).
Let the current iterate be and . Assume and let the step size be . Then under assumptions , the new estimate satisfies
where and . Furthermore, satisfies
The theorem states that provided we choose the step size based on a point that has constant relative distance to , and we start from such a point, gradient descent on will converge linearly to a neighborhood of . The above theorem immediately implies linear convergence rate for the setting where is standard strongly convex, with parameter . This follows by observing that standard strong convexity implies restricted strong convexity for all values of rank .
Last, we present results for the special case where ; in this case, Fgd finds an optimal point with linear rate, within the equivalent class of orthonormal matrices in .
Corollary 4.3 (Exact recovery of ).
Further, for we recover the exact case of semi-definite optimization. In plain words, the above corollary suggests that, given an accuracy parameter , Fgd requires iterations in order to achieve ; recall the analogous result for classic gradient schemes for -smooth and strongly convex functions , where similar rates can be achieved in space . The above are abstractly illustrated in Figure 2.
By the results above, one can easily observe that the convergence rate factor , in contrast to standard convex gradient descent results, depends both on the condition number of and , in addition to . This dependence is a result of the step size selection, which is different from standard step sizes, i.e., for standard gradient descent schemes. We also refer the reader to Section E for some discussion.
As a ramification of the above, notice that depends only on the condition number of and not that of . This suggests that, in settings where the optimum has bad condition number (and thus leads to slower convergence), it is indeed beneficial to restrict to be a matrix and only search for a rank- approximation of the optimal solution, which leads to faster convergence rate in practice; see Figure 8 in our experimental findings at the end of Section F.3.
In the setting where the optimum is 0, directly applying the above theorems requires an initialization that is exactly at the optimum 0. On the contrary, this is actually an easy setting and the Fgd converges from any initial point to the optimum.
In the previous section, we show that gradient descent over achieves sublinearlinear convergence, once the iterates are closer to . Since the overall problem is non-convex, intuition suggests that we need to start from a “decent” initial point, in order to get provable convergence to .
One way to satisfy this condition for general convex is to use one of the standard convex algorithms and obtain within constant error to (or ); then, switch to Fgd to get the high precision solution. See  for a specific implementation of this idea on matrix sensing. Such initialization procedure comes with the following guarantees; the proof can be found in Section D:
Let be a -smooth and -restricted strongly convex function over PSD matrices and let be the minimum of with . Let be the projected gradient descent update. Then, implies,
Next, we present a generic initialization scheme for general smooth and strongly convex . We use only the first-order oracle: we only have access to—at most—gradient information of . Our initialization comes with theoretical guarantees w.r.t. distance from optimum. Nevertheless, in order to show small relative distance in the form of , one requires certain condition numbers of and further assumptions on the spectrum of optimal solution and rank . However, empirical findings in Section F.3 show that our initialization performs well in practice.
Let . Since the initial point should be in the PSD cone, we further consider the projection . By strong convexity and smoothness of , one can observe that the point is a good initialization point, within some radius from the vicinity of ; i.e.,
see also Theorem 5.2. Thus, a scaling of by could serve as a decent initialization. In many recent works [42, 64, 19, 82, 23] this initialization has been used for specific applications.121212To see this, consider the case of least-squares objective , where denote the set of observations and, is a properly designed sensing mechanism, depending on the problem at hand. For example, in the affine rank minimization case [82, 23], represents the linear system mechanism where . Under this setting, computing the gradient at zero point, we have: , where is the adjoint operator of . Then, it is obvious that the operation is very similar to the spectral methods, proposed for initialization in the references above. Here, we note that the point can be used as initialization point for generic smooth and strongly convex .
The smoothness parameter is not always easy to compute exactly; in such cases, one can use the surrogate . Finally, our initial point is a rank- matrix such that .
We now present guarantees for the initialization discussed. The proof is provided in Section D.2.
Theorem 5.2 (Initialization).
Let be a -smooth and -strongly convex function, with condition number , and let be its minimum over PSD matrices. Let be defined as:
and is its rank- approximation. Let for some . Then, , where and .
To understand this result, notice that in the extreme case, when is the loss function , which has condition number and , indeed is the optimum. More generally as the condition number increases, the optimum moves away from and the above theorem characterizes this error as a function of condition number of the function. See also Figure 3.
Now for the setting when the optimum is exactly rank- we get the following result.
Corollary 5.3 (Initialization, exact).
Let be rank- for some . Then, under the conditions of Theorem 5.2, we get
Finally, for the setting when the function satisfies -restricted strong convexity, the above corollary still holds as the optimum is a rank- matrix.
6 Convergence proofs for the Fgd algorithm
In this section, we first present the key techniques required for analyzing the convergence of Fgd. Later, we present proofs for both Theorems 4.1 and 4.2. Throughout the proofs we use the following notation. is the optimum of problem (1) and is the rank- approximation; for the just smooth case, , as we consider only the rank- case and . Let and .
A key property that assists classic gradient descent to converge to the optimum is the fact that for a smooth convex function ; in the case of strongly convex , the inner product is further lower bounded by (see Theorem 2.2.7 of ). Classical proofs mainly use such lower bounds to show convergence (see Theorems 2.1.13 and 2.2.8 of ).
We follow broadly similar steps in order to show convergence of Fgd. In particular,
6.1 Rudiments of our analysis
Next, we present the main descent lemma that is used for both sublinear and linear convergence rate guarantees of Fgd.
Lemma 6.1 (Descent lemma).
For being a -smooth and -strongly convex function and, under assumptions and , the following inequality holds true:
Further, when is just -smooth convex function and, under the assumptions and , we have:
First, we rewrite the inner product as shown below.
which follows by adding and subtracting .
Strongly convex setting. For this case, the next 3 steps apply.
Step I: Bounding . The first term in the above expression can be lower bounded using smoothness and strong convexity of and, involves a construction of a feasible point . We construct such a feasible point by modifying the current update to one with bigger step size .
Let be a -smooth and -restricted strongly convex function with optimum point . Moreover, let be the best rank- approximation of . Let . Then,
where , by Lemma A.5.
Proof of this lemma is provided in Section B.1.
Step II: Bounding . The second term in equation (13) can actually be negative. Hence, we lower bound it using our initialization assumptions. Intuitively, the second term is smaller than the first one as it scales as , while the first term scales as .
Let be -smooth and -restricted strongly convex. Then, under assumptions and , the following bound holds true:
Proof of this lemma can be found in Section B.2.
Smooth setting. For this case, the next 3 steps apply.
Step I: Bounding . Similar to the strongly convex case, one can obtain a lower bound on , according to the following Lemma:
Let be a -smooth convex function with optimum point . Then, under the assumption that , the following holds:
The proof of this lemma can be found in Appendix C.
Step II: Bounding . Here, we follow a different path in providing a lower bound for . The following lemma provides such a lower bound.
Let and define . Let . Then, for , where