Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

Prateek Jain111Microsoft Research, India. Email: prajain@microsoft.com    Chi Jin222University of California, Berkeley. Email: chijin@cs.berkeley.edu    Sham M. Kakade333University of Washington. Email: sham@cs.washington.edu    Praneeth Netrapalli444Microsoft Research, India. Email: praneeth@microsoft.com
Abstract

While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds – establishing global convergence and obtaining a good rate of convergence for the problem of computing squareroot of a positive definite (PD) matrix, which is a widely studied problem in numerical linear algebra with applications in machine learning and statistics among others.

Given a PD matrix and a PD starting point , we show that gradient descent with appropriately chosen step-size finds an -accurate squareroot of in iterations, where . Our result is the first to establish global convergence for this problem and that it is robust to errors in each iteration. A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra.

1 Introduction

Given that a large number of problems and frameworks in machine learning are non-convex optimization problems (examples include non-negative matrix factorization (Lee and Seung, 2001), sparse coding (Aharon et al., 2006), matrix sensing (Recht et al., 2010), matrix completion (Koren et al., 2009), phase retrieval (Netrapalli et al., 2015) etc.), in the last few years, there has been an increased interest in designing efficient non-convex optimization algorithms. Several recent works establish local convergence to the global optimum for problems such as matrix sensing (Jain et al., 2013; Tu et al., 2015), matrix completion (Jain and Netrapalli, 2014; Sun and Luo, 2015), phase retrieval (Candes et al., 2015), sparse coding (Agarwal et al., 2013) and so on (and hence, require careful initialization). However, despite strong empirical evidence, none of these results have been able to establish global convergence. On the other hand some other recent works (Nesterov and Polyak, 2006; Ge et al., 2015; Lee et al., 2016; Sun et al., 2015) establish the global convergence of gradient descent methods to local minima for a large class of non-convex problems but the results they obtain are quite suboptimal compared to the local convergence results mentioned above. In other words, results that have very good rates are only local (and results that are global do not have very good rates).

Therefore, a natural and important question is if gradient descent actually has a good global convergence rate when applied to specific and important functions that are of interest in machine learning. Apart from theoretical implications, such a result is also important in practice since a) finding a good initialization might be difficult and b) local convergence results are inherently difficult to extend to stochastic algorithms due to noise.

In this work, we answer the above question in affirmative for the problem of computing square root of a positive definite (PD) matrix : i.e., where . This problem in itself is a fundamental one and arises in several contexts such as computation of the matrix sign function (Higham, 2008), computation of data whitening matrices, signal processing applications (Kaminski et al., 1971; Carlson, 1990; Van Der Merwe and Wan, 2001; Tippett et al., 2003) and so on.

1.1 Related work

Given the importance of computing the matrix squareroot, there has been a tremendous amount of work in the numerical linear algebra community focused on this problem (Björck and Hammarling, 1983; Higham, 1986, 1987, 1997; Meini, 2004). For a detailed list of references, see Chapter 6 in Higham’s book (Higham, 2008).

The basic component of most these algorithms is the Newton’s method to find the square root of a positive number. Given a positive number and a positive starting point , Newton’s method gives rise to the iteration

(1)

It can be shown that the iterates converge to at a quadratic rate (i.e., -accuracy in iterations). The extension of this approach to the matrix case is not straight forward due to non commutativity of matrix multiplication. For instance, if and were matrices, it is not clear if should be replaced by or or something else. One approach to overcome this issue is to select carefully to ensure commutativity through all iterations (Higham, 1986, 1997; Meini, 2004), for example, or . However, commutativity is a brittle property and small numerical errors in an iteration itself can result in loss of commutativity. Although a lot of work since, has focused on designing stable iterations that are inspired by Eq.(1) (Higham, 1986, 1997; Meini, 2004), and has succeeded in making it robust in practice, no provable robustness guarantees are known in the presence of repeated errors. Similarly, another recent approach by  Sra (2015) uses geometric optimization to solve the matrix squareroot problem but their analysis also does not address the stability or robustness to numerical or statistical errors (if we see a noisy version of ) .

Another approach to solve the matrix square-root problem is to use the eigenvalue decomposition (EVD) and then take square-root of the eigenvalues. To the best of our knowledge, state-of-the-art computation complexity for computing the EVD of a matrix (in the real arithmetic model of computation) is due to Pan et al. (1998), which is for matrices with distinct eigenvalues. Though the result is close to optimal (in reducing the EVD to matrix multiplication), the algorithm and the analysis are quite complicated. For instance robustness of these methods to errors is not well understood. As mentioned above however, our focus is to understand if local search techniques like gradient descent (which are often applied to several non-convex optimization procedures) indeed avoid saddle points and local minima, and can guide the solution to global optimum.

As we mentioned earlier, Ge et al. (2015); Lee et al. (2016) give some recent results on global convergence for general non-convex problems which can be applied to matrix squareroot problem. While Lee et al. (2016) prove only asymptotic behavior of gradient descent without any rate, applying the result of Ge et al. (2015) gives us a runtime of 555For optimization problem of dimension , Ge et al. (2015) proves convergence in the number of iteration of , with computation per iteration. In matrix squareroot problem , which gives total dependence., which is highly suboptimal in terms of its dependence on and .

Finally, we note that subsequent to this work, Jin et al. (2017) proved global convergence results with almost sharp dimension dependence for a much wider class of functions. While Jin et al. (2017) explicitly add perturbation to help escape saddle points, our framework does not require perturbation, and shows that for this problem, gradient descent naturally stays away from saddle points.

1.2 Our contribution

Method Runtime Global convergence Provable robustness
Gradient descent (this paper)
Stochastic gradient descent (Ge et al., 2015)
Newton variants (Higham, 2008)
EVD (algebraic (Pan et al., 1998)) Not iterative
EVD (power method (Golub and Van Loan, 2012)) Not iterative
Table 1: Comparison of our result to existing ones. Here is the matrix multiplication exponent and is our convergence rate parameter defined in Eq.(3). We show that our method enjoys global convergence and is also provably robust to arbitrary bounded errors in each iteration. In contrast, Newton variants only have local convergence and their robustness to errors in multiple iterations is not known. Robustness of methods based on eigenvalue decomposition is also not well understood.

In this paper, we propose doing gradient descent on the following non-convex formulation:

(2)

We show that if the starting point is chosen to be a positive definite matrix, our algorithm converges to the global optimum of Eq.(2) at a geometric rate. In order to state our runtime, we make the following notation:

(3)

where and are the minimum singular value and operator norm respectively of the starting point , and and are those of . Our result says that gradient descent converges close to the optimum of Eq.(2) in iterations. Each iteration involves doing only three matrix multiplications and no inversions or leastsquares. So the total runtime of our algorithm is , where is the matrix multiplication exponent(Williams, 2012). As a byproduct of our global convergence guarantee, we obtain the robustness of our algorithm to numerical errors for free. In particular, we show that our algorithm is robust to errors in multiple steps in the sense that if each step has an error of at most , then our algorithm achieves a limiting accuracy of . Another nice feature of our algorithm is that it is based purely on matrix multiplications, where as most existing methods require matrix inversion or solving a system of linear equations. An unsatisfactory part of our result however is the dependence on , where is the condition number of . We prove a lower bound of iterations for our method which tells us that the dependence on problem parameters in our result is not a weakness in our analysis.

Outline: In Section 2, we will briefly set up the notation we will use in this paper. In Section 3, we will present our algorithm, approach and main results. We will present the proof of our main result in Section 4 and conclude in Section 5. The proofs of remaining results can be found in the Appendix.

2 Notation

Let us briefly introduce the notation we will use in this paper. We use boldface lower case letters () to denote vectors and boldface upper case letters () to denote matrices. denotes the input matrix we wish to compute the squareroot of. denotes the singular value of . denotes the smallest singular value of . denotes the condition number of i.e., . without an argument denotes . denotes the largest eigenvalue of and denotes the smallest eigenvalue of .

3 Our Results

In this section, we present our guarantees and the high-level approach for the analysis of Algorithm 1 which is just gradient descent on the non-convex optimization problem:

(4)
0:  , PD matrix
0:  
  for  do
     
  end for
  Return .
Algorithm 1 Gradient descent for matrix square root

We first present a warmup analysis, where we assume that all the iterates of Algorithm 1 commute with . Later, in Section 3.2 we present our approach to analyze Algorithm 1 for any general starting point . We provide formal guarantees in Section 3.3.

3.1 Warmup – Analysis with commutativity

In this section, we will give a short proof of convergence for Algorithm 1, when we ensure that all iterates commute with .

Lemma 3.1.

There exists a constant such that if , and is chosen to be , then in Algorithm 1 satisfies:

Proof.

Since has the same eigenvectors as , it can be seen by induction that has the same eigenvectors as for every . Every singular value can be written as

(5)

Firstly, this tells us that for every . Verifying this is easy using induction. The statement holds for by hypothesis. Assuming it holds for , the induction step follows by considering the two cases and separately and using the assumption that . A similar induction argument also tells us that . Eq.(5) can now be used to yield the following convergence equation:

where we used the hypothesis on in the last two steps. Using induction gives us

This can now be used to prove the lemma:

Note that the above proof crucially used the fact that the eigenvectors of and are aligned, to reduce the matrix iterations to iterations only over the singular values.

3.2 Approach

As we begin to investigate the global convergence properties of Eq.(4), the above argument breaks down due to lack of alignment between the singular vectors of and those of the iterates . Let us now take a step back and consider non-convex optimization in general. There are two broad reasons why local search approaches fail for these problems. The first is the presence of local minima and the second is the presence of saddle points. Each of these presents different challenges: with local minima, local search approaches have no way of certifying whether the convergence point is a local minimum or global minimum; while with saddle points, if the iterates get close to a saddle point, the local neighborhood looks essentially flat and escaping the saddle point may take exponential time.

The starting point of our work is the realization that the non-convex formulation of the matrix squareroot problem does not have any local minima. This can be argued using the continuity of the matrix squareroot function, and this statement is indeed true for many matrix factorization problems. The only issue to be contended with is the presence of saddle points. In order to overcome this issue, it suffices to show that the iterates of the algorithm never get too close to a saddle point. More concretely, while optimizing a function with iterates , it suffices to show that for every , always stay in some region far from saddle points so that for all :

(6)
(7)

where , and and are some constants. If we flatten matrix to be -dimensional vector, then Eq.(6) is the standard smoothness assumption in optimization, and Eq.(7) is known as gradient dominated property (Polyak, 1963; Nesterov and Polyak, 2006). If Eq.(6) and Eq.(7) hold, it follows from standard analysis that gradient descent with a step size achieves geometric convergence with

Since the gradient in our setting is , in order to establish Eq.(7), it suffices to lower bound . Similarly, in order to establish Eq.(6), it suffices to upper bound . Of course, we cannot hope to converge if we start from a saddle point. In particular Eq.(7) will not hold for any . The core of our argument consists of Lemmas 4.3 and 4.2, which essentially establish Eq.(6) and Eq.(7) respectively for the matrix squareroot problem Eq.(4), with the resulting parameters and dependent on the starting point . Lemmas 4.3 and 4.2 accomplish this by proving upper and lower bounds respectively on and . The proofs of these lemmas use only elementary linear algebra and we believe such results should be possible for many more matrix factorization problems.

3.3 Guarantees

In this section, we will present our main results establishing that gradient descent on (4) converges to the matrix square root at a geometric rate and its robustness to errors in each iteration.

3.3.1 Noiseless setting

The following theorem establishes geometric convergence of Algorithm 1 from a full rank initial point.

Theorem 3.2.

There exist universal numerical constants and such that if is a PD matrix and , then for every , we have be a PD matrix with

where and are defined as

Remarks:

  • This result implies global geometric convergence. Choosing , in order to obtain an accuracy of , the number of iterations required would be .

  • Note that saddle points of (4) must be rank degenerate matrix () and starting Algorithm 1 from a point close to the rank degenerate surface takes a long time to get away from the saddle surface. Hence, as gets close to being rank degenerate, convergence rate guaranteed by Theorem 3.2 degrades (as ). It is possible to obtain a smoother degradation with a finer analysis, but in the current paper, we trade off optimal results for a simple analysis.

  • The convergence rate guaranteed by Theorem 3.2 also depends on the relative scales of and (say as measured by ) and is best if it is close to .

  • We believe that it is possible to extend our analysis to the case where is low rank (PSD). In this case, suppose , and let be the k-dimensional subspace in which resides. Then, saddle points should satisfy .

A simple corollary of this result is when we choose , where (such a can be found in time  Musco and Musco (2015)).

Corollary 3.3.

Suppose we choose , where . Then for .

3.3.2 Noise Stability

Theorem 3.2 assumes that the gradient descent updates are performed with out any error. This is not practical. For instance, any implementation of Algorithm 1 would incur rounding errors. Our next result addresses this issue by showing that Algorithm 1 is stable in the presence of small, arbitrary errors in each iteration. This will establish the stability of our algorithm in the presence of round-off errors for instance. Formally, we consider in every gradient step, we incur an error .

The following theorem shows that as long as the errors are small enough, Algorithm 1 recovers the true squareroot upto an accuracy of the error floor. The proof of the theorem follows fairly easily from that of Theorem 3.2.

Theorem 3.4.

There exist universal numerical constants and such that the following holds: Suppose is a PD matrix and where and are defined as before:

Suppose further that . Then, for every , we have be a PD matrix with

Remarks:

  • Since the errors above are multiplied by a decreasing sequence, they can be bounded to obtain a limiting accuracy of .

  • If there is error in only the first iteration i.e., for , then the initial error is attenuated with every iteration,

    That is, our dependence on is exponentially decaying with respect to time . On the contrary, best known results only guarantees the error dependence on will not increase significantly with respect to time (Higham, 2008).

3.3.3 Lower Bound

We also prove the following lower bound showing that gradient descent with a fixed step size requires iterations to achieve an error of .

Theorem 3.5.

For any value of , we can find a matrix such that, for any step size , there exists an initialization that has the same eigenvectors as , with and , such that we will have for all .

This lemma shows that the convergence rate of gradient descent fundamentally depends on the condition number , even if we start with a matrix that has the same eigenvectors and similar scale as . In this case, note that the lower bound of Theorem 3.5 is off from the upper bound of Theorem 3.2 by . Though we do not elaborate in this paper, it is possible to formally show that a dependence of is the best bound possible using our argument (i.e., one along the lines of Section 3.2).

4 Proof Sketch for Theorem 3.2

In this section, we will present the proof of Theorem 3.2. To make our strategy more concrete and transparent, we will leave the full proofs of some technical lemmas in Appendix A.

At a high level, our framework consists of following three steps:

  1. Show all bad stationary points lie in a measure zero set for some constructed potential function . In this paper, for the matrix squareroot problem, we choose the potential function to be the smallest singular value function .

  2. Prove for any , if initial and the stepsize is chosen appropriately, then we have all iterates . That is, updates will always keep away from bad stationary points.

  3. Inside regions , show that the optimization function satisfies good properties such as smoothness and gradient-dominance, which establishes convergence to a global minimum with good rate.

Since we can make arbitrarily small and since is a measure zero set, this essentially establishes convergence from a (Lebesgue) measure one set, proving global convergence.

We note that step 2 above implies that no stationary point found in the set is a local minimum – it must either be a saddle point or a local maximum. This is because starting at any point outside does not converge to . Therefore, our framework can be mostly used for non-convex problems with saddle points but no spurious local minima.

Before we proceed with the full proof, we will first illustrate the three steps above for a simple, special case where and all relevant matrices are diagonal. Specifically, we choose target matrix and parameterize as:

Here and are unknown parameters. Since we are concerned with , we see that . The reason we restrict ourselves to diagonal matrices is so that the parameter space is two dimensional letting us give a good visual representation of the parameter space. Figures 2 and 2 show the plots of function value contours and negative gradient flow respectively as a function of and .

Figure 1: Contour of Objective Functions
Figure 2: Flow of Negative Gradient

We will use Figures 2 and 2 to qualitatively establish the three steps in our framework.

  1. From Figure 2, we note that is the global minimum. are saddle points, while is local maximum. We notice all the stationary points which are not global minima lie on the surface , that is, the union of x-axis and y-axis.

  2. By defining a boundary for some small and large (corresponding to the red box in Figure 2), we see that negative gradient flow is pointed inside the box which means that for any point in the box, performing gradient descent with a small enough stepsize will ensure that all iterates lie inside the box (and hence keep away from saddle points).

  3. Inside the red box, Figure 2 shows that negative gradient flow points to the global optimum. Moreover, we can indeed establish upper and lower bounds on the magnitude of gradients within the red box – this corresponds to establishing smoothness and gradient dominance respectively.

Together, all the above observations along with standard results in optimization tell us that gradient descent has geometric convergence for this problem.

We now present a formal proof of our result.

4.1 Location of Saddle Points

We first give a characterization of locations of all the stationary points which are not global minima.

Lemma 4.1.

Within symmetric PSD cone , all stationary points of which are not global minima, must satisfy .

Proof.

For any stationary point of which is not on the boundary , by linear algebra calculation, we have:

Therefore, since is not on the boundary of PSD cone, we have , which gives , thus is global minima. ∎

As mentioned before, note that all the bad stationary points are contained in which is a (Lebesgue) measure zero set.

4.2 Stay Away from Saddle Surface

Since the gradient at stationary points is zero, gradient descent can never converge to a global minimum if starting from suboptimal stationary points. Fortunately, in our case, gradient updates will keep away from bad stationary points. As in next Lemma, we show that as long as we choose suitable small learning rate, will never be too small.

Lemma 4.2.

Suppose , where is a small enough constant. Then, for every , we have in Algorithm 1 be a PD matrix with

It turns out that the gradient updates will not only keep from being too small, but also keep from being too large.

Lemma 4.3.

Suppose . For every , we have in Algorithm 1 satisfying:

Although is not directly related to the surface with bad stationary points, the upper bound on is crucial for the smoothness of function , which gives good convergence rate in Section 4.3.

4.3 Convergence in Saddle-Free Region

So far, we have been able to establish both upper bounds and lower bounds on singular values of all iterates given suitable small learning rate. Next, we show that when spectral norm of is small, function is smooth, and when is large, function is gradient dominated.

Lemma 4.4.

Function is -smooth in region . That is, for any , we have:

Lemma 4.5.

Function is -gradient dominated in region . That is, for any , we have:

Lemma 4.4 and 4.5 are the formal versions of Eq.(6) and Eq.(7) in Section 3.2, which are essential in establishing geometric convergence.

Putting all pieces together, we are now ready prove our main theorem:

Proof of Theorem 3.2.

Recall the definitions in Theorem 3.2:

By choosing learning rate with small enough constant . We can satisfy the precondition of Lemma 4.2, and Lemma 4.3 at same time. Therefore, we know all iterates will fall in region:

Then, apply Lemma 4.4 and Lemma 4.5, we know in this region, function has smoothness parameter:

and gradient dominance parameter:

That is, in the region is both -smooth, and -gradient dominated.

Finally, by Taylor’s expansion of smooth function, we have:

The second last inequality is again by setting constant in learning rate to be small enough, and the last inequality is by the property of gradient dominated. This finishes the proof.

5 Conclusion

In this paper, we take a first step towards addressing the large gap between local convergence results with good convergence rates and global convergence results with highly suboptimal convergence rates. We consider the problem of computing the squareroot of a PD matrix, which is a widely studied problem in numerical linear algebra, and show that non-convex gradient descent achieves global geometric convergence with a good rate. In addition, our analysis also establishes the stability of this method to numerical errors. We note that this is the first method to have provable robustness to numerical errors for this problem and our result illustrates that global convergence results are also useful in practice since they might shed light on the stability of optimization methods.

Our result shows that even in the presence of a large saddle point surface, gradient descent might be able to avoid it and converge to the global optimum at a linear rate. We believe that our framework and proof techniques should be applicable for several other nonconvex problems (especially those based on matrix factorization) in machine learning and numerical linear algebra and would lead to the analysis of gradient descent and stochastic gradient descent in a transparent way while also addressing key issues like robustness to noise or numerical errors.

References

  • Agarwal et al. [2013] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Praneeth Netrapalli. Learning sparsely used overcomplete dictionaries via alternating minimization. arXiv preprint, 2013.
  • Aharon et al. [2006] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11):4311–4322, 2006.
  • Björck and Hammarling [1983] Åke Björck and Sven Hammarling. A Schur method for the square root of a matrix. Linear algebra and its applications, 52:127–140, 1983.
  • Candes et al. [2015] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. Information Theory, IEEE Transactions on, 61(4):1985–2007, 2015.
  • Carlson [1990] Neal Carlson. Federated square root filter for decentralized parallel processors. Aerospace and Electronic Systems, IEEE Transactions on, 26(3):517–525, 1990.
  • Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
  • Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
  • Higham [1986] Nicholas J Higham. Newton’s method for the matrix square root. Mathematics of Computation, 46(174):537–549, 1986.
  • Higham [1987] Nicholas J Higham. Computing real square roots of a real matrix. Linear Algebra and its applications, 88:405–430, 1987.
  • Higham [1997] Nicholas J Higham. Stable iterations for the matrix square root. Numerical Algorithms, 15(2):227–242, 1997.
  • Higham [2008] Nicholas J Higham. Functions of matrices: theory and computation. Society for Industrial and Applied Mathematics (SIAM), 2008.
  • Jain and Netrapalli [2014] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finite samples. arXiv preprint, 2014.
  • Jain et al. [2013] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674. ACM, 2013.
  • Jin et al. [2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
  • Kaminski et al. [1971] Paul G Kaminski, Arthur E Bryson Jr, and Stanley F Schmidt. Discrete square root filtering: A survey of current techniques. Automatic Control, IEEE Transactions on, 16(6):727–736, 1971.
  • Koren et al. [2009] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
  • Lee and Seung [2001] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.
  • Lee et al. [2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. University of California, Berkeley, 1050:16, 2016.
  • Meini [2004] Beatrice Meini. The matrix square root from a new functional perspective: theoretical results and computational issues. SIAM journal on matrix analysis and applications, 26(2):362–376, 2004.
  • Musco and Musco [2015] Cameron Musco and Christopher Musco. Stronger approximate singular value decomposition via the block lanczos and power methods. arXiv preprint, 2015.
  • Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • Netrapalli et al. [2015] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. IEEE Transactions on Signal Processing, 63(18):4814–4826, 2015.
  • Pan et al. [1998] Victor Y Pan, Zhao Q Chen, and Ailong Zheng. The complexity of the algebraic eigenproblem. Mathematical Sciences Research Institute, Berkeley, page 71, 1998.
  • Polyak [1963] BT Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
  • Recht et al. [2010] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
  • Sra [2015] Suvrit Sra. On the matrix square root via geometric optimization. arXiv preprint, 2015.
  • Sun et al. [2015] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.
  • Sun and Luo [2015] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270–289. IEEE, 2015.
  • Tippett et al. [2003] Michael K Tippett, Jeffrey L Anderson, Craig H Bishop, Thomas M Hamill, and Jeffrey S Whitaker. Ensemble square root filters*. Monthly Weather Review, 131(7):1485–1490, 2003.
  • Tu et al. [2015] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
  • Van Der Merwe and Wan [2001] Ronell Van Der Merwe and Eric Wan. The square-root unscented kalman filter for state and parameter-estimation. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 6, pages 3461–3464. IEEE, 2001.
  • Williams [2012] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 887–898. ACM, 2012.

Appendix A Proofs of Lemmas in Section 4

In this section, we restate the Lemmas in Section 4 which were used to prove Theorem 3.2, and present their proofs.

First, we prove lemmas which show a lower bound and a upper bound on the eigenvalues of the intermediate matrices in Algorithm 1. This shows always stay away from the surface where unwanted stationary point locate.

Lemma A.1 (Restatement of Lemma 4.2).

Suppose , where is a small enough constant. Then, for every , we have be a PD matrix with

Proof.

We will prove the lemma by induction. The base case holds trivially. Suppose the lemma holds for some . We will now prove that it holds for . We have

(8)

When , using Lemma 4.3 we can bound the first term as

(9)

To bound the second term, for any vector with , let , where is the eigenvector of , and . Then:

(10)

where is due to the fact that is a PD matrix, so , and is because since , we have .

Plugging Eq.(9) and Eq.(10) into Eq.(8), we have:

When , we obtain:

and when , we have:

This concludes the proof. ∎

Lemma A.2 (Restatement of Lemma 4.3).

Suppose . For every , we have:

Proof.

We will prove the lemma by induction. The base case is trivially true. Supposing the statement is true for , we will prove it for .

Using the update equation of Algorithm 1, we have: