Subspace Iteration Randomization and Singular Value Problems
A classical problem in matrix computations is the efficient and reliable approximation of a given matrix by a matrix of lower rank. The truncated singular value decomposition (SVD) is known to provide the best such approximation for any given fixed rank. However, the SVD is also known to be very costly to compute. Among the different approaches in the literature for computing low-rank approximations, randomized algorithms have attracted researchers’ recent attention due to their surprising reliability and computational efficiency in different application areas. Typically, such algorithms are shown to compute with very high probability low-rank approximations that are within a constant factor from optimal, and are known to perform even better in many practical situations. In this paper, we present a novel error analysis that considers randomized algorithms within the subspace iteration framework and show with very high probability that highly accurate low-rank approximations as well as singular values can indeed be computed quickly for matrices with rapidly decaying singular values. Such matrices appear frequently in diverse application areas such as data analysis, fast structured matrix computations and fast direct methods for large sparse linear systems of equations and are the driving motivation for randomized methods. Furthermore, we show that the low-rank approximations computed by these randomized algorithms are actually rank-revealing approximations, and the special case of a rank- approximation can also be used to correctly estimate matrix -norms with very high probability. Our numerical experiments are in full support of our conclusions.
key words: low-rank approximation, randomized algorithms, singular values, standard Gaussian matrix.
Randomized algorithms have established themselves as some of the most competitive methods for rapid low-rank matrix approximation, which is vital in many areas of scientific computing, including principal component analysis [47, 65] and face recognition [60, 78], large scale data compression [21, 22, 35, 56] and fast approximate algorithms for PDEs and integral equations [16, 33, 57, 71, 72, 83, 82]. In this paper, we consider randomized algorithms for low-rank approximations and singular value approximations within the subspace iteration framework, leading to results that simultaneously retain the reliability of randomized algorithms and the typical faster convergence of subspace iteration methods.
Given any matrix with , its singular value decomposition (SVD) is described by the equation
where is an column orthogonal matrix; is an orthogonal matrix; and with . Writing and in terms of their columns,
then and are the left and right singular vectors corresponding to , the -th largest singular value of . For any , we let
be the (rank-) truncated SVD of . The matrix is unique only if . The assumption that will be maintained throughout this paper for ease of exposition. Our results still hold for by applying all the algorithms on . Similarly, all our main results are derived under the assumption that . But they remain unchanged even if , and hence remain valid by a continuity argument. All our analysis is done without consideration of round-off errors, and thus need not hold exactly true in finite precision, especially when the user tolerances for the low-rank approximation are close to machine precision levels. Additionally, we assume throughout this paper that all matrices are real. In general, is an ideal rank- approximation to , due to the following celebrated property of the SVD:
While there are results similar to Theorem 1.1 for all unitarily invariant matrix norms, our work on low-rank matrix approximation bounds will only focus on the two most popular of such norms: the 2-norm and the Frobenius norm.
Theorem 1.1 states that the truncated SVD provides a rank- approximation to with the smallest possible 2-norm error and Frobenius-norm error. In the 2-norm, any rank- approximation will result in an error no less than , and in the Frobenius-norm, any rank- approximation will result in an error no less than . Additionally, the singular values of are exactly the first singular values of , and the singular vectors of are the corresponding singular vectors of . Note, however, that while the solution to problem (1.3) must be , solutions to problem (1.2) are not unique and include, for example, the rank- matrix defined below for any :
This subtle distinction between the 2-norm and Frobenius norm will later on become very important in our analysis of randomized algorithms (see Remark 3.4.) In Theorem 3.4 we prove an interesting result related to Theorem 1.1 for rank- approximations that only solve problems (1.2) and (1.3) approximately.
To compute a truncated SVD of a general matrix , one of the most straightforward techniques is to compute the full SVD and truncate it, with a standard linear algebra software package like the LAPACK . This procedure is stable and accurate, but it requires floating point operations, or flops. This is prohibitively expensive for applications such as data mining, where the matrices involved are typically sparse with huge dimensions. In other practical applications involving the truncated SVD, often the very objective of computing a rank- approximation is to avoid excessive computation on . Hence it is desirable to have schemes that can compute a rank- approximation more efficiently. Depending on the reliability requirements, a good rank- approximation can be a matrix that is accurate to within a constant factor from the optimal, such as a rank-revealing factorization (more below), or it can be a matrix that closely approximates the truncated SVD itself.
Many approaches have been taken in the literature for computing low-rank approximations, including rank-revealing decompositions based on the QR, LU, or two-sided orthogonal (aka UTV) factorizations [14, 25, 32, 42, 59, 63, 44]. Recently, there has been an explosion of randomized algorithms for computing low-rank approximations [16, 21, 22, 27, 28, 54, 53, 55, 61, 80, 70]. There is also software package available for computing interpolative decompositions, a form of low-rank approximation, and for computing the PCA, with randomized sampling . These algorithms are attractive for two main reasons: they have been shown to be surprisingly efficient computationally; and like subspace methods, the main operations involved in many randomized algorithms can be optimized for peak machine performance on modern architectures. For a detailed analysis of randomized algorithms and an extended reference list, see ; for a survey of randomized algorithms in data analysis, see .
The subspace iteration is a classical approach for computing singular values. There is extensive convergence analysis on subspace iteration methods [30, 19, 4, 3] and a large literature on accelerated subspace iteration methods . In general, it is well-suited for fast computations on modern computers because its main computations are in terms of matrix-matrix products and QR factorizations that have been highly optimized for maximum efficiency on modern serial and parallel architectures [19, 30]. There are two well-known weaknesses of subspace iteration, however, that limit its practical use. On one hand, subspace iteration typically requires very good separation between the wanted and unwanted singular values for good convergence. On the other hand, good convergence also often critically depends on the choice of a good start matrix [4, 3].
Another classical class of approximation methods for computing an approximate SVD are the Krylov subspace methods, such as the Lanczos algorithm (see, for example [10, 17, 49, 51, 69, 81].) The computational cost of these methods depends heavily on several factors, including the start vector, properties of the input matrix and the need to stabilize the algorithm. One of the most important part of the Krylov subspace methods, however, is the need to do a matrix-vector product at each iteration. In contrast to matrix-matrix products, matrix-vector products perform very poorly on modern architectures due to the limited data reuse involved in such operations, In fact, one focus of Krylov subspace research is on effective avoidance of matrix-vector operations in Krylov subspace methods (see, for example [31, 67].)
This work focuses on the surprisingly strong performance of randomized algorithms in delivering highly accurate low-rank approximations and singular values. To illustrate, we introduce Algorithm 1.1, one of the basic randomized algorithms (see .)
Basic Randomized Algorithm
|Input:||matrix with , integers and .|
|Output:||a rank- approximation.|
Draw a random test matrix .
Compute an orthogonal column basis for .
Compute , the rank- truncated SVD of .
Throughout this paper, a random matrix, such as in Algorithm 1.1, is a standard Gaussian matrix, i.e., its entries are independent standard normal variables of zero mean and standard deviation .
While other random matrices might work equally well, the choice of the Gaussian matrix provides two unique advantages: First, the distribution of a standard Gaussian matrix is rotationally invariant: If is an orthonormal matrix, then is itself a standard Gaussian matrix with the same statistical properties as . Second, our analysis is much simplified by the vast literature on the singular value probability density functions of the Gaussian matrix.
While Algorithm 1.1 looks deceptively simple, its analysis is long, arduous, and involves very strong doses of statistics . The following theorem establishes an error bound on the accuracy of as a low-rank approximation to . There are similar results in the Frobenius norm.
Comparing Theorem 1.2 with Theorem 1.1, it is clear that Algorithm 1.1 could provide a very good low rank approximation to with probability at least , despite its simple operations, provided that . While algorithms [16, 21, 22, 27, 28, 54, 53, 80] differ in their algorithm design, efficiency, and domain applicability, they typically share the same advantages of computational efficiency and approximation accuracy.
Algorithm 1.1 is the combination of Stages A and B of the Proto Algorithm in , where the truncated SVD is considered separately from low-rank approximation. In Section 2.3 we will discuss the pros and cons of SVD truncation vs. no truncation. Algorithm 1.1 is a special case of the randomized subspace iteration method (see Algorithm 2.2), for which Halko, Martinsson, Tropp  have developed similar results.
However, while the upper bound in Theorem 1.2 can be very satisfactory for many applications, there may be situations where singular value approximations are also desirable. In addition, it is well-known that in practical computations randomized algorithms often far outperform their error bounds [35, 58, 66], whereas the results in  do not suggest convergence of the computed rank- approximation to the truncated SVD in either Algorithm 1.1 or the more general randomized subspace iteration method.
Our entire work is based on novel analysis of the subspace iteration method, and we consider randomized algorithms within the subspace iteration framework. This allows us to take advantage of existing theories and technical machinery in both fields.
Current analysis on randomized algorithms focuses on the errors in the approximation of by a low rank matrix, whereas classical analysis on subspace iteration methods focuses on the accuracy in the approximate singular values. Our analysis allows us to obtain both kinds of results for both of these methods, leading to the stronger rank-revealing approximations. In terms of randomized algorithms, our matrix approximation bounds are in general tighter and can be drastically better than existing ones; in terms of singular values, our relative convergence lower bounds can be interpreted as simultaneously convergence error bounds and rank-revealing lower bounds.
The leading singular values computed by randomized algorithms are at least a good fraction of the true ones, regardless of how the singular values are distributed, and they converge quickly to the true singular values in case of rapid singular value decay. In particular, this result implies that randomized algorithms can also be used as efficient and reliable condition number estimators.
The above results, together with the fact that randomized algorithms compute low-rank approximations up to a dimension dependent constant factor from optimal, mean that these low-rank approximations are in fact rank-revealing factorizations. In addition, for rapidly decaying singular values, these approximations can be as accurate as a truncated SVD.
The subspace iteration method in general and the power method in particular is still slowly convergent without over-sampling in the start matrix. We present an alternative choice of the start matrix based on our analysis, and demonstrate its competitiveness.
The rest of this paper is organized as follows: In Section 2 we discuss subspace iteration methods and their randomized versions in more detail; in Section 3 we list a number of preliminary as well as key results needed for later analysis; in Section 4 we derive deterministic lower bounds on singular values and upper bounds on low-rank approximations; in Section 5 we provide both average case and large deviation bounds on singular values and low-rank approximations; in Section 6 we compare these approximations with other rank-revealing factorizations; in Section 7 we discuss how randomized algorithms can be used as efficient and reliable condition number estimators; in Section 8 we present supporting numerical experimental results; and in Section 9 we draw some conclusions and point out possible directions for future research.
Much of our analysis has its origin in the analysis of subspace iteration  and randomized algorithms . It relies both on linear algebra tools as well as statistical analysis to do some of the needed heavy lifting to reach our conclusions. To limit the length of this paper, we have put the more detailed parts of the analysis as well as some additional numerical experimental results in the Supplemental Material, which is accessible at SIAM’s on-line portal.
In this section, we present the main algorithms that are discussed in the rest of this paper. We also discuss subtle differences between our presentation of randomized algorithms and that in .
2.1 Basic Algorithms
We start with the classical subspace iteration method for computing the largest few singular values of a given matrix.
Basic Subspace Iteration
|Input:||matrix with , integers ,|
|and start matrix .|
|Output:||a rank- approximation.|
Compute an orthogonal column basis for .
Compute , the rank- truncated SVD of .
Given the availability of Lanczos-type algorithms for the singular value computations, the classical subspace iteration method is not widely used in practice except when . We present it here for later comparisons with its randomized version. We ignore the vast literature of accelerated subspace iteration methods (see, for example ) in this paper since our main goal here is to analyze the convergence behavior of subspace iteration method with and without randomized start matrix .
We have presented Algorithm 2.1 in an over-simplified form above to convey the basic ideas involved. In practice, the computation of would be prone to round-off errors. For better numerical accuracy, Algorithm A.1 in the Appendix should be used numerically to compute the matrix in Algorithm 2.1. In practical computations, however, Algorithm A.1 is often performed once every few iterations, to balance efficiency and numerical stability (see Saad .) In the rest of Section 2, any QR factorization of the matrix should be computed numerically through periodic use of Algorithm A.1.
While there is little direct analysis of subspace iteration methods for singular values (Algorithm 2.1) in the literature, one can generalize results of subspace iteration methods for symmetric matrices to the singular value case in a straightforward fashion. The symmetric matrix version of Theorem 2.1 can be found in .
(Bathe and Wilson) Assume that Algorithm 2.1 converges as . Then
Thus convergence is governed by the ratio . The per-iteration cost of Algorithm 2.1 depends linearly on . A choice can be economical if the more rapid convergence obtained through the ratio can more than offset the extra cost per iteration. Another important issue with Algorithm 2.1 is the constant hidden in the notation. This constant can be exceedingly large for the unfortunate choices of . In fact, an matrix that is almost orthogonal to any leading singular vectors will lead to large number of iterations. Both issues will be made clearer with our relative convergence theory for Algorithm 2.1 in Theorem 4.3.
A special case of Algorithm 2.1 is when . This is the classical power method for computing the 2-norm of a given matrix. This method, along with its randomized version, is included in Appendix A for later discussion in our numerical experiments (see Section 8.) The power method has the same convergence properties of Algorithm 2.1. More generally, the subspace iteration method is typically run with .
2.2 Randomized Algorithms
In order to enhance the convergence of Algorithm 2.1 in the absence of any useful information about the leading singular vectors, a sensible approach is to replace the deterministic start matrix with a random one, leading to
Randomized Subspace Iteration
|Input:||matrix with , integers ,|
|Output:||a rank- approximation.|
Draw a random start matrix .
Compute a rank- approximation with Algorithm 2.1.
The only difference between Algorithm 2.1 and Algorithm 2.2 is in the choice of , yet this difference will lead to drastically different convergence behavior. One of the main purposes of this paper is to show that the slow or non-convergence of Algorithm 2.1 due to bad choice of vanishes with near certainty in Algorithm 2.2. In particular, a single iteration ( in Algorithm 2.2) in the randomized subspace iteration method is often sufficient to return good enough singular values and low-rank approximations (Section 5).
Our analysis of deterministic and randomized subspace iteration method was in large part motivated by the analysis and discussion of randomized algorithms in . We have chosen to present the algorithms in Section 2 in forms that are not identical to those in  for ease of stating our results in Sections 4 through 8. Versions of Algorithm 2.2 have also appeared in  for solving large-scale discrete inverse problems.
2.3 To Truncate or not to Truncate
The randomized algorithms in Section 2 are presented in a slight different form than those in . One key difference is in the step of SVD truncation, which is considered an optional postprocessing step there. In this section, we discuss the pros and cons of SVD truncation. We start with the following simple lemma, versions of which appear in [7, 23, 35].
Given an matrix with orthonormal columns , with , then for any matrix ,
Lemma 2.2 makes it obvious that any SVD truncation of will only result in a less accurate approximation in the 2-norm and Frobenius norm. This is strong motivation for no SVD truncation. The SVD truncation of also involves the computation of the SVD of in some form, which also results in extra computation.
On the other hand, since singular values of approximate their corresponding singular values in at different rates, some singular values of may be poor approximations of those of , and need not be a good rank- approximation to , either. In contrast, for the right choices of , the rank- truncated SVD of can contain excellent approximate singular values and result in a good rank- approximation to as well. So the choice of whether to truncate the SVD of depends on practical considerations of computational efficiency and demands on quality of singular value and low-rank approximations. This paper focuses on a rank- approximations obtained from truncated SVD of .
In this section we build some of the technical machinery needed for our heavy analysis later on. We start by reciting two well-known results in matrix analysis, and then develop a number of theoretical tools that outline our approach in the low-rank approximation analysis. Some of these results may be of interest in their own right. For any matrix , we use to denote its -th largest singular value.
The Cauchy interlacing theorem shows the limitations of any approximation with an orthogonal projection.
(Golub and van Loan [30, p. 411]) Let be an matrix and be a matrix with orthonormal columns. Then for .
A direct consequence of Theorem 3.1 is that , where is any submatrix of .
Weyl’s monotonicity theorem relates singular values of matrices and to those of .
(Weyl’s monotonicity theorem [43, Thm. 3.3.16]) Let and be matrices with . Then
The Hoffman-Wielandt theorem bounds the errors in the differences between the singular values of and those of in terms of .
(Hoffman and Wielandt ) Let and be matrices with . Then
Below we develop a number of theoretical results that will form the basis for our later analysis on low-rank approximations. Theorem 3.4 below is of potentially broad independent interest. Let be a rank- approximation to . Theorem 3.4 below relates the approximation error in the Frobenius norm to that in the 2-norm as well as the approximation errors in the leading singular values. It will be called the Reverse Eckart and Young Theorem due to its complimentary nature with Theorem 1.1 in the Frobenius norm.
(Reverse Eckart and Young) Given any matrix , and let be a matrix with rank at most such that
for some . Then we must have
Equation (3.2) can be simplified to
when is larger than or close to . On the other hand, if , then equation (3.2) simplifies to
where the last ratio can be much smaller than , implying a much better rank- approximation in . Similar comments apply to equation (3.1). This interesting feature of Theorem 3.4 is one of the reasons why our eventual 2-norm and Frobenius norm upper bounds are much better than those in Theorem 1.2 in the event that . This also has made our proofs in Appendix B somewhat involved in places.
Equation (3.3) asserts that a small in equation (3.1) necessarily means good approximations to all the leading singular values of . In particular, means the leading singular values of and must be the same. However, our singular value analysis will not be based on Equation (3.3), as our approach in Section 4 provides us with much better results.
since is a rank- matrix. It follows that
Our next theorem is a generalization of Theorem 1.1.
Let be an matrix with orthonormal columns, let , and let be the rank- truncated SVD of . Then is an optimal solution to the following problem
In addition, we also have
Problem (3.5) in Theorem 3.5 is a type of restricted SVD problem. Oddly enough, this problem becomes much harder to solve for the 2-norm. In fact, might not even be the solution to the corresponding restricted SVD problem in 2-norm. Combining Theorems 3.4 and 3.5, we obtain
Our low-rank approximation analysis in the -norm will be based on equation (3.7). While this is sufficient, it also makes our -norm results perhaps weaker than they should be due to the mixture of the -norm and the Frobenius norm.
Thus we can expect to also be an excellent rank- approximation to as long as points to the principle singular vector directions.
4 Deterministic Analysis
In this section we perform deterministic convergence analysis on Algorithm 2.1. Theorem 4.3 is a relative convergence lower bound, and Theorem 4.4 is an upper bound on the matrix approximation error. Both appear to be new for subspace iteration. Our approach, while quite novel, was motivated in part by the analysis of subspace iteration methods by Saad  and randomized algorithms in . Since Algorithm 1.1 is a special case of Algorithm 2.2 with , which in turn is a special case of Algorithm 2.1 with an initial random matrix, our analysis applies to them as well and will form the basis for additional probabilistic analysis in Section 5.
4.1 A Special Orthonormal Basis
We begin by noticing that the output in Algorithm 2.1 is also the rank- truncated SVD of the matrix , due to the fact that is column orthonormal. In fact, columns of are nothing but an orthonormal basis for the column space of matrix . This is the reason why Algorithm 2.1 is called subspace iteration. Lemma 4.1 below shows how to obtain alternative orthonormal bases for the same column space. We omit the proof.
In the notation of Algorithm 2.1, assume that is a non-singular matrix and that has full column rank. Let be the QR factorization of the matrix , then
define and partition
where . The introduction of the additional parameter is to balance the need for oversampling for reliability (see Theorem 1.2) and oversampling for faster convergence (see Theorem 2.1). We also partition , where , , and are , , and . This partition allows us to further write
The matrix has at least as many columns as rows. Assume it is of full row rank so that its pseudo-inverse satisfies
Below we present a special choice of that will reveal the manner in which convergence to singular values and low-rank approximations takes place. Ideally, such an would orient the first columns of in the directions of the leading singular vectors in . We choose
where the matrix is chosen so that is non-singular and . Recalling equation (4.2), this choice of allows us to write
Notice that we have created a “gap” in : the largest singular value in is , which is potentially much smaller than , the smallest singular value in . We can expect to converge to rather quickly when , if and if the matrix is not too large in norm. Our convergence analysis of Algorithms 2.1 and 2.2 will mainly involve deriving upper bounds on various functions related to . Our gap disappears when we choose , in which case our results in Section 5 will be more in line with Theorem 1.2.
By equation (4.4), the QR factorization of can now be written in the following partition:
We will use this representation to derive convergence upper bounds for singular value and rank-k approximations. In particular, we will make use of the fact that the above QR factorization also embeds another one
We are now ready to derive a lower bound on .
It might seem more intuitive in equation (4.3) to choose where solves the following least squares problem
Our choice of seems as effective and allows simpler analysis.
On the other hand, we also have
Combining these two relations, and together with the fact that , we obtain (4.7). Q.E.D.
4.2 Deterministic Bounds
In this section we develop the analysis in Section 4.1 into deterministic lower bounds for singular values and upper bounds for rank-k approximations.
Since the interlacing theorem 3.1 asserts an upper bound , equation (4.7) provides a nice lower bound on . These bounds mean that is a good approximation to as long as is small. This consideration is formalized in the theorem below.
This, together with lower bound (4.7), gives the result in Theorem 4.3 for . To prove Theorem 4.3 for any , we observe that since , all that is needed is to repeat all previous arguments for a rank truncated SVD. Q.E.D.
Theorem 4.3 makes explicit the two key factors governing the convergence of Algorithm 2.1. On one hand, we can expect fast convergence for if . On the other hand, an unfortunate choice of could potentially make very large, leading to slow converge even if the singular values do decay quickly. The main effect of randomization in Algorithm 2.2 is to ensure a reasonably sized with near certainty. See Theorem 5.8 for a precise statement and more details.
Since , and since according to equation (4.6), the above right hand side becomes
where we have used the fact that (see (4.6))
We are now ready to prove
With the notation of Section 4, we have
The last expression is a symmetric positive semi-definite matrix. This allows us to write
By a continuity argument, the last relation remains valid even if does not have full column rank.