Random perturbation of low rank matrices

Random perturbation of low rank matrices: Improving classical bounds

Sean O’Rourke Department of Mathematics, University of Colorado at Boulder, Boulder, CO 80309 sean.d.orourke@colorado.edu Van Vu Department of Mathematics, Yale University, PO Box 208283, New Haven , CT 06520-8283, USA van.vu@yale.edu  and  Ke Wang Institute for Mathematics and its Applications, University of Minnesota, Minneapolis, MN 55455, USA wangk@umn.edu
Abstract.

Matrix perturbation inequalities, such as Weyl’s theorem (concerning the singular values) and the Davis-Kahan theorem (concerning the singular vectors), play essential roles in quantitative science; in particular, these bounds have found application in data analysis as well as related areas of engineering and computer science.

In many situations, the perturbation is assumed to be random, and the original matrix has certain structural properties (such as having low rank). We show that, in this scenario, classical perturbation results, such as Weyl and Davis-Kahan, can be improved significantly. We believe many of our new bounds are close to optimal and also discuss some applications.

Key words and phrases:
Singular values, singular vectors, singular value decomposition, random perturbation, random matrix
2010 Mathematics Subject Classification:
65F15 and 15A42
S. O’Rourke is supported by grant AFOSAR-FA-9550-12-1-0083.
V. Vu is supported by research grants DMS-0901216 and AFOSAR-FA-9550-09-1-0167.

1. Introduction

The singular value decomposition of a real matrix is a factorization of the form , where is a orthogonal matrix, is a rectangular diagonal matrix with non-negative real numbers on the diagonal, and is an orthogonal matrix. The diagonal entries of are known as the singular values of . The columns of are the left-singular vectors of , while the columns of are the right-singular vectors of . If is symmetric, the singular values are given by the absolute value of the eigenvalues, and the singular vectors can be expressed in terms of the eigenvectors of . Here, and in the sequel, whenever we write singular vectors, the reader is free to interpret this as left-singular vectors or right-singular vectors provided the same choice is made throughout the paper.

An important problem in statistics and numerical analysis is to compute the first singular values and vectors of an matrix . In particular, the largest few singular values and corresponding singular vectors are typically the most important. Among others, this problem lies at the heart of Principal Component Analysis (PCA), which has a very wide range of applications (for many examples, see [27, 35] and the references therein) and in the closely related low rank approximation procedure often used in theoretical computer science and combinatorics. In application, the dimensions and are typically large and is small, often a fixed constant.

1.1. The perturbation problem

A problem of fundamental importance in quantitative science (including pure and applied mathematics, statistics, engineering, and computer science) is to estimate how a small perturbation to the data effects the singular values and singular vectors. This problem has been discussed in virtually every text book on quantitative linear algebra and numerical analysis (see, for instance, [8, 23, 24, 47]), and is the main focus of this paper.

We model the problem as follows. Consider a real (deterministic) matrix with singular values

and corresponding singular vectors We will call the data matrix. In general, the vector is not unique. However, if has multiplicity one, then is determined up to sign. Instead of , one often needs to work with , where represents the perturbation matrix. Let

denote the singular values of with corresponding singular vectors . In this paper, we address the following two questions.

Question 1.

When is a good approximation of ?

Question 2.

When is a good approximation of ?

These two questions are classically addressed by the Davis-Kahan-Wedin sine theorem and Weyl’s inequality. Let us begin with the first question in the case when . A canonical way (coming from the numerical analysis literature; see for instance [22]) to measure the distance between two unit vectors and is to look at , where is the angle between and taken in . It has been observed by numerical analysts (in the setting where is deterministic) for quite some time that the key parameter to consider in the bound is the gap (or separation) . The first result in this direction is the famous Davis-Kahan sine theorem [20] for Hermitian matrices. A version for the singular vectors was proved later by Wedin [57].

Throughout the paper, we use to denote the spectral norm of a matrix . That is, is the largest singular value of .

Theorem 3 (Davis-Kahan, Wedin; sine theorem; Theorem V.4.4 from [47]).
(1)

In certain cases, such as when is random, it is more natural to deal with the gap

(2)

between the first and second singular values of instead of . In this case, Theorem 3 implies the following bound.

Theorem 4 (Modified sine theorem).
Remark 5.

Theorem 4 is trivially true when since sine is always bounded above by one. In other words, even if the vector is not uniquely determined, the bound is still true for any choice of . On the other hand, when , the proof of Theorem 4 reveals that the vector is uniquely determined up to sign.

As the next example shows, the bound in Theorem 4 is sharp, up to the constant .

Example 6.

Let , and take

Then , with and . Hence, . In addition,

and a simple computation reveals that , but and . Thus,

since .

More generally, one can consider approximating the -th singular vector or the space spanned by the first singular vectors . Naturally, in these cases, a version of Theorem 4 requires one to consider the gaps

see Theorems 19 and 21 below for details.

Question 2 is addressed by Weyl’s inequality. In particular, Weyl’s perturbation theorem [58] gives the following deterministic bound for the singular values (see [47, Theorem IV.4.11] for a more general perturbation bound due to Mirsky [40]).

Theorem 7 (Weyl’s bound).

For more discussions concerning general perturbation bounds, we refer the reader to [10, 47] and references therein. We now pause for a moment to prove Theorem 4.

Proof of Theorem 4.

If , the theorem is trivially true since sine is always bounded above by one. Thus, assume . By Theorem 7, we have

and hence the singular vectors and are uniquely determined up to sign. By another application of Theorem 7, we obtain

Rearranging the inequality, we have

Therefore, by (1), we conclude that

and the proof is complete. ∎

1.2. The random setting

Let us now focus on the matrices and . It has become common practice to assume that the perturbation matrix is random. Furthermore, researchers have observed that data matrices are usually not arbitrary. They often possess certain structural properties. Among these properties, one of the most frequently seen is having low rank (see, for instance, [14, 15, 16, 19, 51] and references therein).

The goal in this paper is to show that in this situation, one can significantly improve classical results like Theorems 4 and 7. To give a quick example, let us assume that and are matrices and that is a random Bernoulli matrix, i.e., its entries are independent and identically distributed (iid) random variables that take values with probability . It is well known that in this case with high probability111We use asymptotic notation under the assumption that . Here we use to denote a term which tends to zero as tends to infinity. [7, Chapter 5]. Thus, the above two theorems imply the following.

Corollary 8.

If is an Bernoulli222More generally, Corollary 8 applies to a large class of random matrices with independent entries. Indeed, the results in [7, Chapter 5] and hence Corollary 8 hold when is any random matrix whose entries are iid random variables with zero mean, unit variance (which is just a matter of normalization), and bounded fourth moment. random matrix, then, for any , with probability ,

and

(3)

Among others, this shows that we must have in order for the bound in (3) to be nontrivial. It turns out that the bounds in Corollary 8 are far from being sharp. Indeed, we present the results of a numerical simulation for being a matrix of rank 2 when , , and where is a random Bernoulli matrix. It is easy to see that for the parameters and , Corollary 8 does not give a useful bound (since ). However, Figure 1 shows that, with high probability, , which means approximates with a relatively small error. Our main results attempt to address this inefficiency in the Davis-Kahan-Wedin and Weyl bounds and provide sharper bounds than those given in Corollary 8. As a concrete example, in the case when is a random Bernoulli matrix, our results imply the following bounds.

Figure 1. The cumulative distribution functions of where is a deterministic matrix with rank ( for the figure on top and for the one below) and the noise is a Bernoulli random matrix, evaluated from samples (top figure) and samples (bottom figure). In both figures, the largest singular value of is taken to be .
Theorem 9.

Let be a Bernoulli random matrix, and let be a matrix with rank . For every there exists constants (depending only on ) such that if and , then, with probability at least ,

Theorem 10.

Let be an Bernoulli random matrix, and let be an matrix with rank satisfying . For every , there exists a constant (depending only on ) such that, with probability at least ,

In particular, when the rank is significantly smaller than , the bounds in Theorems 9 and 10 are significantly better than those appearing in Corollary 8. The intuition behind Theorems 9 and 10 comes from the following heuristic of the second author.

If has rank , all actions of focus on an dimensional subspace; intuitively then, must act like an dimensional random matrix rather than an dimensional one.

This means that the real dimension of the problem is , not . While it is clear that one cannot automatically ignore the (rather wild) action of outside the range of , this intuition, if true, explains the appearance of the factor in the bounds of Theorems 9 and 10 instead of the factor appearing in Corollary 8.

While Theorems 9 and 10 are stated only for Bernoulli random matrices , our main results actually hold under very mild assumptions on and . As a matter of fact, in the strongest results, we will not even need the entries of to be independent.

1.3. Preliminaries: Models of random noise

We now state the assumptions we require for the random matrix . While there are many models of random matrices, we can capture almost all natural models by focusing on a common property.

Definition 11.

We say the random matrix is -concentrated if for all unit vectors , and every ,

(4)

The key parameter is . It is easy to verify the following fact, which asserts that the concentration property is closed under addition.

Fact 12.

If is -concentrated and is -concentrated, then is -concentrated for some depending on .

Furthermore, the concentration property guarantees a bound on . A standard net argument (see Lemma 28) shows

Fact 13.

If is -concentrated then there are constants such that .

For readers not familiar with random matrix theory, let us point out why the concentration property is expected to hold for many natural models. If is random and is fixed, then the vector must look random. It is well known that in a high dimensional space, a random isotropic vector, with very high probability, is nearly orthogonal to any fixed vector. Thus, one expects that very likely, the inner product of and is small. Definition 11 is a way to express this observation quantitatively.

It turns out that all random matrices with independent entries satisfying a mild condition have the concentration property. Indeed, if denotes the -entry of and the entries of are assumed to be independent, then the bilinear form

is just a sum of independent random variables. If, in addition, the entries of have mean zero, then, by linearity, also has mean zero. Hence, (4) can be viewed as a concentration inequality, which expresses how the sum of independent random variables deviates from its mean. With this interpretation in mind, many models of random matrices can be shown to satisfy (4). In particular, Lemma 34 shows that if is a Bernoulli random matrix, then is -concentrated, and with high probability [53, 54]. However, a convenient feature of the definition is that independence between the entries is not a requirement. For instance, it is easy to show that a random orthogonal matrix satisfies the concentration property. We continue the discussion of the -concentration property (Definition 11) in Section 6.

2. Main results

We now state our main results. We begin with an extension of Theorem 9.

Theorem 14.

Assume that is -concentrated for a trio of constants , and suppose has rank . Then, for any ,

(5)

with probability at least

(6)
Remark 15.

Using Fact 13, one can replace on the right-hand side of (5) by , which yields that

with probability at least

However, we prefer to state our theorems in the form of Theorem 14, as the bound , in many cases, may not be optimal.

Because Theorem 14 is stated in such generality, the bounds can be difficult to interpret. For example, it is not completely obvious when the probability in (6) is close to one. Roughly speaking, the two error terms in the probability bound are controlled by the gap and the parameter (which can be taken to be any positive value). Specifically, the first term

(7)

goes to zero as gets larger, and the second term

(8)

goes to zero as tends to infinity. As a consequence, we obtain the following immediate corollary of Theorem 14 (and Lemma 36) in the case when the entries of are independent.

Corollary 16.

Assume that is an random matrix with independent entries which have mean zero and are bounded almost surely in magnitude by for some . Suppose has rank . Then for every , there exists (depending only on and ) such that if , then

(9)

with probability at least .

The first term on the right-hand side of (9) is precisely the conjectured optimal bound coming from the intuition discussed above. The second term is necessary. If , then the intensity of the noise is much stronger than the strongest signal in the data matrix, so would corrupt completely. Thus in order to retain crucial information about , it seems necessary to assume . We are not absolutely sure about the necessity of the third term , but under the condition , this term is superior to the Davis-Kahan-Wedin bound appearing in Theorem 4.

While it remains an open question to determine whether the bounds in Theorem 14 are optimal, we do note that in certain situations the bounds are close to optimal. Indeed, in [9], the eigenvectors of perturbed random matrices are studied, and, under various technical assumptions on the matrices and , the results in [9] give the exact asymptotic behavior of the dot product . Rewriting the dot product in terms of cosine (and further expressing the value in terms of sine), we find that the bounds in (5) match the exact asymptotic behavior obtained in [9], up to constant factors. Similar results in [43] also match the bound in (5), up to constant factors, in the case when is a Wigner random matrix and has rank one.

Corollary 16 provides a bound which holds with probability at least . As another consequence of Theorem 14, we obtain the following bound which holds with probability converging to .

Corollary 17.

Assume that is an random matrix with independent entries which have mean zero and are bounded almost surely in magnitude by for some . Suppose has rank . Then there exists (depending only on ) such that if is any sequence of positive values converging to infinity and , then

with probability . Here, the rate of convergence implicit in the notation depends on and .

Before continuing, we pause to make one final remark regarding Corollaries 16 and 17. In stating our main results below, we will always state them in the generality of Theorem 14. However, each of the results can be specialized in several different directions similar to what we have done in Corollaries 16 and 17. In the interest of space, we will not always state all such corollaries.

We are able to extend Theorem 14 in two different ways. First, we can bound the angle between and for any index . Second, and more importantly, we can bound the angle between the subspaces spanned by and , respectively. As the projection onto the subspaces spanned by the first few singular vectors (i.e., low rank approximation) plays an important role in a vast collection of problems, this result potentially has a large number of applications.

We begin by bounding the largest principal angle between

(10)

for some integer , where is the rank of . Let us recall that if and are two subspaces of the same dimension, then the (principal) angle between them is defined as

(11)

where denotes the orthogonal projection onto subspace .

Theorem 18.

Assume that is -concentrated for a trio of constants . Suppose has rank , and let be an integer. Then, for any ,

(12)

with probability at least

(13)

where and are the -dimensional subspaces defined in (10).

The error terms in (13) (as well as all other probability bounds appearing in our main results) can be controlled in a similar fashion as the error terms (7) and (8). Indeed, the first error term in (13) is controlled by the gap and the second term is controlled by the parameter .

We believe the factor of in (12) is suboptimal and is simply an artifact of our proof. However, in many applications is significantly smaller than the dimension of the matrices, making the contribution from this term negligible.

For comparison, we present an analogue of Theorem 4, which follows from the Davis-Kahan-Wedin sine theorem [47, Theorem V.4.4], using the same argument as in the proof of Theorem 4.

Theorem 19 (Modified Davis-Kahan-Wedin sine theorem: singular space).

Suppose has rank , and let be an integer. Then, for an arbitrary matrix ,

where and are the -dimensional subspaces defined in (10).

It remains an open question to give an optimal version of Theorem 18 for subspaces corresponding to an arbitrary set of singular values. However, we can use Theorem 18 repeatedly to obtain bounds for the case when one considers a few intervals of singular values. For instance, by applying Theorem 18 twice, we obtain the following result. Denote .

Corollary 20.

Assume that is -concentrated for a trio of constants . Suppose has rank , and let be integers. Then, for any ,

(14)

with probability at least

where

(15)
Proof.

Let

For any subspace , let denote the orthogonal projection onto . It follows that , where denotes the identity matrix. By definition of the subspaces , we have

Thus, by (11), we obtain

Theorem 18 can now be invoked to bound and , and the claim follows. ∎

Again, the factor of appearing in (14) follows from the analogous factor appearing in (12). Indeed, if this factor could be removed from (12), then the proof above shows that it would also be removed from (14).

For comparison, we present the following version of Theorem 4, which follows Theorem 19 and the argument above. Again denote .

Theorem 21 (Modified Davis-Kahan-Wedin sine theorem: singular space).

Suppose has rank , and let be integers. Then, for an arbitrary matrix ,

where and are defined in (15).

We now consider the problem of approximating the -th singular vector recursively in terms of the bounds for , .

Theorem 22.

Assume that is -concentrated for a trio of constants . Suppose has rank , and let be an integer. Then, for any ,

with probability at least

The bound in Theorem 22 depends inductively on the bounds for , , and as such, we do not believe it to be sharp. The bound does, however, improve upon a similar recursive bound presented in [53].

Finally, let us present the general form of Theorem 10 for singular values. Readers can compare the result with the classical bound in Theorem 7.

Theorem 23.

Assume that is -concentrated for a trio of constants . Suppose has rank , and let be an integer. Then, for any ,

(16)

with probability at least

and

(17)

with probability at least

Remark 24.

Notice that the upper bound for given in (17) involves . In many situations, the lower bound in (16) can be used to provide an upper bound for .

We conjecture that the factors of and appearing in (17) are not needed and are simply an artifact of our proof. In applications, is typically much smaller than the dimension, often making the contribution from these terms negligible. To illustrate this point, consider the following example when . Let and be symmetric matrices, and assume the entries on and above the diagonal of are independent random variables. Such a matrix is known as a Wigner matrix, and the eigenvalues of perturbed Wigner matrices have been well-studied in the random matrix theory literature; see, for instance, [31, 44] and references therein. In particular, the results in [31, 44] give the asymptotic location of the largest eigenvalues as well as their joint fluctuations. These exact asymptotic results imply that, in this setting, the bounds appearing in Theorem 23 are sharp, up to constant factors.

As the bounds in Theorem 23 are fairly general, let us state a corollary in the case when the entries of are independent random variables.

Corollary 25.

Assume that is an random matrix with independent entries which have mean zero and are bounded almost surely in magnitude by for some . Suppose has rank . Then, for every , there exists (depending only on and ) such that, with probability at least ,

(18)

for all .

Corollary 25 is an immediate consequence of Theorem 23, Lemma 36, and the union bound. In particular, the bound in (18) holds for all values of simultaneously with probability at least .

2.1. Related results

To conclude this section, let us mention a few related results. In [53], the second author managed to prove

under certain conditions. While the right-hand side is quite close to the optimal form in Theorem 9, the main problem here is that in the left-hand side one needs to square the sine function. The bound for with was done by an inductive argument and was rather complicated. Finally, the problem of estimating the singular values was not addressed at all in [53].

Related results have also been obtained in the case where the random matrix contains Gaussian entries. In [56], R. Wang estimates the non-asymptotic distribution of the singular vectors when the entries of are iid standard normal random variables. Recently, Allez and Bouchaud have studied the eigenvector dynamics of when is a real symmetric matrix and is a symmetric Brownian motion (that is, is a diffusive matrix process constructed from a family of independent real Brownian motions) [2]. Our results also seems to have a close tie to the study of spiked covariance matrices, where a different kind of perturbation has been considered; see [12, 26, 41] for details. It would be interesting to find a common generalization for these problems.

3. Overview and outline

We now briefly give an overview of the paper and discuss some of the key ideas behind the proof of our main results. For simplicity, let us assume that and are real symmetric matrices. (In fact, we will symmetrize the problem in Section 4 below.) Let be the eigenvalues of with corresponding (orthonormal) eigenvectors . Let be the largest eigenvalue of with corresponding (unit) eigenvector .

Suppose we wish to bound (from Theorem 14). Since

it suffices to bound for . Let us consider the case when . In this case, we have

Since and , we obtain

Thus, the problem of bounding reduces to obtaining an upper bound for and a lower bound for the gap . We will obtain bounds for both of these terms by using the concentration property (Definition 11).

More generally, in Section 4, we will apply the concentration property to obtain lower bounds for the gaps when , which will hold with high probability. Let us illustrate this by now considering the gap . Indeed, we note that

Applying the concentration property (4), we see that with probability at least . As , we in fact observe that

Thus, if is sufficiently large, we have (say) with high probability.

In Section 5, we will again apply the concentration property to obtain upper bounds for terms of the form . At the end of Section 5, we combine these bounds to complete the proof of Theorems 14, 18, 22, and 23. In Section 6, we discuss the -concentration property (Definition 11). In particular, we generalize some previous results obtained by the second author in [53]. Finally, in Section 7, we present some applications of our main results.

Singular subspace perturbation bounds are applicable to a wide variety of problems. For instance, [13] discuss several applications of these bounds to high-dimensional statistics including high dimensional clustering, canonical correlation analysis (CCA), and matrix recovery. In Section 7, we show how our results can be applied to the matrix recovery problem. The general matrix recovery problem is the following. is a large matrix. However, the matrix is unknown to us. We can only observe its noisy perturbation , or in some cases just a small portion of the perturbation. Our goal is to reconstruct or estimate an important parameter as accurately as possible from this observation. Furthermore, several problems from combinatorics and theoretical computer science can also be formulated in this setting. Special instances of the matrix recovery problem have been investigated by many researchers using spectral techniques and combinatorial arguments in ingenious ways [1, 3, 4, 5, 11, 14, 15, 16, 17, 18, 21, 28, 29, 32, 33, 34, 37, 39, 42, 45].

We propose the following simple analysis: if has rank and , then the projection of on the subspace spanned by the first singular vectors of is close to the projection of onto the subspace spanned by the first singular vectors of , as our new results show that and are very close. Moreover, we can also show that the projection of onto is typically small. Thus, by projecting onto , we obtain a good approximation of the rank approximation of . In certain cases, we can repeat the above operation a few times to obtain sufficient information to recover completely or to estimate the required parameter with high accuracy and certainty.

4. Preliminary tools

In this section, we present some of the preliminary tools we will need to prove Theorems 14, 18, 22, and 23.

To begin, we define the symmetric block matrices

(19)

and

We will work with the matrices and instead of and . If and , then and . In particular, the non-zero eigenvalues of are and the eigenvectors are formed from the left and right singular vectors of . Similarly, the non-trivial eigenvalues of are (some of which may be zero) and the eigenvectors are formed from the left and right singular vectors of .

Along these lines, we introduce the following notation, which differs from the notation used above. The non-zero eigenvalues of will be denoted by with orthonormal eigenvectors , such that

Let be the orthonormal eigenvectors of corresponding to the -largest eigenvalues .

In order to prove Theorems 14, 18, 22, and 23, it suffices to work with the eigenvectors and eigenvalues of the matrices and . Indeed, Proposition 26 will bound the angle between the singular vectors of and by the angle between the corresponding eigenvectors of and .

Proposition 26.

Let and be unit vectors. Let be given by

Then

Proof.

Since , we have

Thus,

and the claim follows. ∎

We now introduce some useful lemmas. The first lemma below, states that if is -concentrated, then is -concentrated, for some new constants and .

Lemma 27.

Assume that is -concentrated for a trio of constants . Let and . Then for all unit vectors , and every ,

(20)
Proof.

Let

be unit vectors in , where and . We note that

Thus, if any of the vectors are zero, (20) follows immediately from (4). Assume all the vectors are nonzero. Then

Thus, by (4), we have

and the proof of the lemma is complete. ∎

We will also consider the spectral norm of . Since is a symmetric matrix whose eigenvalues in absolute value are given by the singular values of , it follows that

(21)

We introduce -nets as a convenient way to discretize a compact set. Let . A set is an -net of a set if for any , there exists such that . The following estimate for the maximum size of an -net of a sphere is well-known (see for instance [52]).

Lemma 28.

A unit sphere in dimensions admits an -net of size at most

Lemmas 29, 30, and 31 below are consequences of the concentration property (20).