Random perturbation and matrix sparsification and completion

Random perturbation and matrix sparsification and completion

Sean O’Rourke Department of Mathematics, University of Colorado at Boulder, Boulder, CO 80309 sean.d.orourke@colorado.edu Van Vu Department of Mathematics, Yale University, New Haven , CT 06520, USA van.vu@yale.edu  and  Ke Wang Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong kewang@ust.hk
Abstract.

We discuss general perturbation inequalities when the perturbation is random. As applications, we obtain several new results concerning two important problems: matrix sparsification and matrix completion.

S. O’Rourke is supported by by NSF grant ECCS-1610003.
V. Vu is supported by research grants DMS-0901216 and AFOSAR-FA-9550-09-1-0167.
Ke Wang is supported by by HKUST Initiation Grant IGN16SC05.

1. Introduction

Given a large (data) matrix, the objects of interest are, regularly, the essential spectral parameters of the matrix, such as its spectral norm, the leading singular values and vectors, or the subspace formed by the first few singular vectors (which forms the bases for low rank approximation).

It has been realized that when the matrix is sparse, the computations can be done much faster. Of course, one needs to create a sparse version of a matrix so that its essential parameters do not deviate too much from those of the original. So, given any sparsifying algorithm, the fundamental technical issue is to bound the difference between the key parameters of the original matrix (input) and that of the sparsifier (output). There is a vast literature on this subject including [2, 4, 5, 18, 22, 28, 38, 39, 37, 31, 32, 43, 48] [41, Chapter 6] as well as references therein. Since this subject has been intensively studied, the collection of references above as well as our brief discussion below is far from exhaustive.

In this paper, we present a new method to handle this technical issue through the consideration of a popular and natural randomized algorithm. In previous works, attention has been devoted to bounding the singular values. In certain cases, we will be able to significantly improve upon existing bounds. Furthermore, we obtain similar results for singular vectors and subspaces as well. Our argument works most effectively when has low rank. This is a popular (and realistic) assumption.

The key tool in matrix analysis is perturbation inequalities. These inequalities tell us how much a key parameter of the matrix changes under a perturbation. Our analysis is based on the observation that when the perturbation is random, we can obtain much better bounds compared to what is available in the classical literature. This idea, and the new tools we develop based on it, appears useful in many other problems. As an illustration, we will discuss another popular problem, the matrix completion problem, and prove a new result.

Remark 1.

We will assume has rank which is relatively small compared to the dimensions of the matrix. Our analysis still holds if has full rank and all but singular values are negligible. One just adds an error term like or to the relevant estimates.

The rest of the paper is organized as follows. In the next section, we present our new results for matrix sparsification. Next, we present the new perturbation bounds. In Section 5, we discuss the matrix recovery problem.

1.1. Notation

We assume that is an matrix having rank . We assume and denote by the non-zero singular values of . is the spectral norm of . Let be the entries of , then , and . We set and call it the Cauchy–Schwartz constant of . It is clear, from the Cauchy–Schwartz inquality, that is at most 1. In many natural situations, this parameter can be much smaller than 1. For instance, if has non-zero entries, then . All the asymptotic notations are used under the assumption that .

2. New results

As pointed out in the introduction, our key technical question is

Question 2.

How much does the sparsification procedure alter the key spectral parameters of the matrix?

A popular way to sparsify a matrix is to use a random sparsification, keeping each entry with a probability proportional to its weight. Let us describe a key result concerning this method. Set Choose one entry with respect to this distribution and define by the matrix with exactly one nonzero entry . Repeating the experiment times independently, we end up with matrices . Set . It is trivial to see that . Furthermore, if is relatively small compared to , is sparse as it has at most non-zero entries (some position may get selected more than once).

The following result is due to Kundu and Drineas (see [28] and [41, Section 6.3]).

Theorem 3.

For every , there exists such that if is an matrix and , then

with probability at least .

This result can be used to address Question 2, when combined with classical results from perturbation theory. For example, let us consider the computation of the spectral norm (largest singular value). The goal is to find , but after the sparsification, one ends up computing . Using Theorem 3, we obtain

(1)

We will use short hand and set . In practice, it is often important to have a relative error term; (1) is equivalent to

(2)

Notice that since ,

(3)

It is worth discussing when is significantly smaller than 1. Clearly, we must choose , but that is not enough. The fact that is equivalent to , which means

This implies

In practice, we want to be significantly less than . For example, set , for some small positive constant . Then we have (ignoring the log factor) that

This requirement is naturally satisfied if has rank . Otherwise, one needs to take a closer look to the distribution of the singular values of .

In order to state our new results, let us consider a slight modification of the above algorithm. Choose each entry of , independently with probability and define be the matrix with entries , where are independent -valued indicator random variables (with ). It is clear that , and in expectation has non-zero entries. (One can easily show, using standard large deviation tools, that with high probability, it has at most nonzero entries.) The advantage of this process is that all entries can be selected at once, and there will be no repeated entries. From this point of view, this new process is a somewhat more natural sparsification procedure. On the other hand, taking independent copies of in the original process is convenient for the application of the matrix Chernoff bound. We are not going to use this tool in our proof. To guarantee that , we assume

Since we going to compare our new result to Theorem 3, we assume that the matrix and the parameter satisfy , which guarantees . (Clearly, estimates with are not useful.)

Set . Our first theorem asserts that if then we can obtain a quadratic improvement, namely a bound of order . Recall that and , so is at least . In a typical application, we have for some small constant or even . In these cases .

Theorem 4.

For any and the following holds with sufficiently large . With probability at least ,

(4)

Here and later, the constant in the notation is absolute.

Denote by the non-zero singular values of . Let and be the left and right singular vectors corresponding to . For , we have , and , respectively. For an integer , set

(5)

Define . Similar to , this is the relative error for estimating using Theorem 3 and Weyl’s bound: . The next theorem gives us a quadratic improvement.

Theorem 5.

For any and the following holds with sufficiently large . For any , with probability at least ,

Next we bound the difference between the singular subspaces of and . For a subspace , denote the orthogonal projection onto . To measure the distance between subspaces, let us recall that if and are two subspaces of the same dimension, then the (principal) angle between them is defined as

A new parameter, the th spectral gap plays a crucial role in the singular subspaces perturbation. A classical theorem of [47] (which extends an earlier result of [17]) asserts that if and are the subspaces spanned by the first singular vectors of and , then

This exact inequality can be found in [34] and differs only slightly from those in [47].

In our situation and . (Let us point out that Theorem 3 gives the bound for , but one can prove a similar bound hold for ; see Lemma 9.) Thus, using Wedin theorem together with Theorem 3, we obtain

(6)

Recall that . If is normalized to have average entry of size 1 (namely ), then , as .

For low rank matrices, we have the following new result

Theorem 6.

For any and the following holds with sufficiently large . For any , with probability at least ,

(7)

The same conclusion also holds for .

Since is often significantly larger than and is often significantly larger than , each term in the new estimate improves the bound in (6). More importantly, (6) shows that the approximation of by is reliable if , namely the -th singular value gap of has to be considerably large. Theorem 6 allows us to use the approximation under a much weaker assumption, namely and .

Now we briefly discuss the situation where but is still relatively small compared to . Define where and are the left and right (unit) singular vectors of , respectively, and denotes the -norm. By definition . If and are delocalized vectors, is much smaller. If and are balanced, one expects .

Theorem 7.

For any and the following holds with sufficiently large . With probability at least ,

(8)

where

Our new bound now is either or . If is smaller than then again we get the quadratic improvement. Here are a few scenario that guarantees that the terms of are at most .

  • . In this case, by the definition of . If we choose for some small constant , then it suffices to assume .

  • . Using and a simple calculation, we have .

We can prove similar statements for other singular values and subspaces spanned by other collections of singular vectors; details will appear in the full version of the paper.

3. Proofs of Theorem 4 and Theorem 7

3.1. The intuition

Let us recall that we get (1) by combining Theorem 3 with Weyl’s inequality. Both results are sharp. Our main observation is that it is their combination that is wasteful.

To see why, recall that triangle inequality asserts that . Equality can occur. Indeed, if and share the same leading singular vector , then .

In our situation, however, is a random matrix. Intuitively, we expect its singular vectors are random. Thus, the leading singular vector of should not be the same as the leading singular vector of . In fact, we expect them to be more or less orthogonal. If this can be quantified, then one can show that the impact of on is negligible.

However, there is a danger that the second singular value of can become very large after the perturbation. In term of Weyl’s inequality, this means can be close to if the second singular value of is close to the first, and the second singular vector is close to . Again, when is random, we expect that is far from . In fact, we expect to be more or less orthogonal to the plane spanned by and . Of course, next we need to pay attention to , etc.

Apparently, this process cannot be carried out forever. There are two reasons. First, for a large enough , may be close to the span of with a decent chance. Furthermore, if the leading singular values of are also close to each other (in fact they are for most models of random matrices), then we need to care about other singular vectors of as well. However, if has low rank, then hopefully the process will terminate before the trouble begins. This intuition (which also applies to other parameters) serves as our starting point and motivates the analysis that follows.

3.2. Proof of Theorem 4

As , the matrix

has mean zero. In particular, the entries of are independent random variables with mean zero. By the definition of , we have . Using the definition of , an easy calculation shows the following.

Lemma 8.

, and with probability .

We have the following bound on the spectral norm of , which can be seen as the analogue of Theorem 3.

Lemma 9.

There exists an absolute constant such that for every and every , we have

(9)

In particular, there is an absolute constant such that

(10)

for every .

The proof is presented in Appendix A.

Remark 10.

Among others, the bound in (10) implies that

(11)

with probability at least .

Denote the -dimensional singular subspaces

With a bit of abuse of notation, we also use and to represent matrices and similarly for and .

Lemma 11.

For any and any fixed unit vectors , and

The first part of the lemma follows from Lemma 8 and Chebyshev’s inequality. The second part follows from a standard -net argument, with . With these lemmata in hand, we are now prepared to prove Theorem 4; the proof is presented in Appendix B.

To prove Theorem 7 for general , we first need to replace Lemma 11. Set , where and are defined in Lemma 8.

Lemma 12.

There exist absolute constants such that for any and any fixed unit vectors and with , we have and

Proof.

Notice that for any fixed vectors , ,

where are random variables with mean zero. Since and are bounded by with probability one, is at most with probability 1. Furthermore, it is easy to show that , since for all and are unit vectors. The first statement of the lemma thus follows from Bernstein’s concentration inequality; see, for instance, [8]. The second follows from the first by a standard -net argument (with ). ∎

Proof of Theorem 7.

The proof follows roughly the same framework as the proof of Theorem 4 but instead uses Lemma 12 instead of Lemma 8 in deriving (24) and (25). Indeed, applying these changes we can conclude that there exist constant such that

with probability at least and

with probability at least . For the bound of , we use (26). Putting everything together, we conclude that for sufficiently large , with probability at least

It is easy to check, using the fact that that the first two terms on the right-hand side are . Furthermore,

Notice that

On the other hand, by definition

Putting these together, we conclude that the last term is , where

as desired. ∎

4. General Perturbation inequalities

At the heart of the proofs above are general perturbation inequalities, which captured the intuition in Section 3.1. Extracting the main parts of the perturbation argument above, we immediately obtain the following inequalities which we record here for future use. They are a more flexible version of earlier results by [33].

As before, we work with the matrix where is deterministic matrix of size and rank . is random matrix of the same size. We keep all the notation as before.

The key property we need about is that its action does not favor vectors from and . Quantitatively, assume that we can find a parameter such that with probability at least , for any unit vectors and . Next, we need a bound on . Assume that with probability at least , . In practice, one can optimize and , using large deviation tools and special properties of ; see Appendix D for some general results in this direction. Once the parameters and are found (apparently, they will depend on ), we have the following general theorem.

Theorem 13.

For any , we have, with probability at least

and

We strongly believe that there are many natural situations where one can make use of the general observation from Section 3.1 that classical perturbation bounds can be improved greatly when the perturbation is random. Theorem 13 is one specific way to quantify this improvement, but perhaps there are others.

The study of eigenvalue and eigenvector (alternatively, singular value and singular vector) perturbation has a long history. We refer the reader to the the book [40] for many classical bounds. There are several recent papers related to random perturbation and related random matrix models including [1, 3, 6, 7, 14, 19, 21, 27, 33, 34, 44, 46, 10].

To conclude the note, in the next section, we discuss an application concerning the matrix completion problem, another important problem in data analysis.

5. Matrix recovery

Let be an unknown deterministic matrix of size of rank . Assume only a few entries of are accessible, that is, we can observe a small (random) set of entries from . Furthermore, the observed entries can be corrupted by random noise. The goal is to recover , as accurately as possible, from this corrupted partial data.

Our favorite illustration is the Netflix problem. In this problem, is the movie score matrix. The rows are indexed by names of viewers (customers of Netflix), and the columns by names of movies. Entry is the score that viewer gives to movie , which is an integer between and . A random part of was released to the public and the task was to recover the original matrix.

Following [12, 13, 11], one can model the general matrix recovery problem as follows. We assume each entry of is exposed independently with probability , and an observed entry at position is corrupted by (independent) noise . Thus, the observed matrix is a random matrix of size with entries

where is the noise matrix with independent entries and are iid Bernoulli random variable with expectation . We say is -subgaussian if for any ,

(12)

We assume that the noise are independent random variables with mean zero which are -subgaussian. There is a large collection of literature on the matrix completion problem (see, for instance, [1, 9, 11, 12, 13, 15, 20, 23, 25, 24, 26, 30, 35, 16, 36] and references therein). For example, assuming the noise are iid mean zero and -subgaussian, [25] showed that if are comparable, i.e. is a constant, and , one can construct (efficiently) an approximation which satisfies

(13)

with probability at least . Furthermore, if satisfies appropriate incoherence condition, the authors (see [25, Theorem 1.2 and Theorem 1.3]) proposed an estimator such that, with probability at least ,

(14)

The bound (14) improved a previous coarser bound obtained by [11].

Our contribution toward this problem has two different aspects. First, we observe that in many cases, the matrix that we observe already contains rather accurate information about the spectral parameters of . Thus, if what we want to know about are these parameters, we may just as well compute them from .

Of course, there are situations when we really want to compute . For instance, if we would like to know the scores given for a specific movie or specific Netflix customer, then we do need to fully recover the Netflix matrix (or the relevant row and column, at least). For this task, we discuss a simple algorithm, whose performance is easy to analyze using our new tools. Furthermore, in certain cases, the error bound compares favorably to previous results. An additional feature is that our analysis allows us to obtain an error bound for a specific row and column. This is clearly relevant to the example of Netflix being discussed. There are many other situations when the rows (or columns) of the data matrix are indexed by individuals (such as medical records). A new feature of our method is that we will be able to approximate any given row or column, which cannot be accomplished as accurately using error bounds involving the Frobenius or spectral norm.

Our first theorem shows that under certain conditions, we can estimate fairly accurately, using . Let .

Theorem 14.

There exists an absolute constant such that the following holds. For any , if , then

with probability at least , where

(15)

Notice that we can rewrite the entries of as

where Thus if , then . Therefore we can write , where a random matrix with mean zero, having independent entries .

To prove the above theorem, one can simply apply Theorem 13, with properly chosen parameters and . We present all of the details of the proof, along with some additional bounds for the other singular values and the singular vectors, in Appendix C.

We now discuss a recovery algorithm. Since has rank , , where is the subspace spanned by the singular vectors of . In order to find a rank matrix that approximates , it is natural to consider , where is the subspace spanned by the first singular vectors of . can be computed very efficiently. Of course, this is not a new algorithm (see [25] for instance). Our way of analyzing it is, however, new, via which we can obtain good approximation of any given row or column. (As far as error bounds in the Frobenius norm, the results in [25] is order optimal. However, using the Frobenius norm bound to deduct a bound for a specific row is very wasteful.) Our method begins by writing

(16)

As , by the triangle inequality, the right-hand side is at most

(17)

By Theorem 13 (again with properly chosen parameters and ), we have a good bound, say on . Thus, the error term is

We already estimate (by ). The last task is to bound . Apparently, this is at most as the projection matrix has norm 1. However, this rough estimate is wasteful, and one can do significantly better by using the observation that the projection of a random vector of length 1 in onto a fixed subspace of dimension typically has length . This is obvious if is gaussian; for a general statement, see [45].

The detailed calculation is a bit technical, but follows a similar argument as was detailed above, and we delay it to the full version of the paper. Let us state a corollary that applies for a specific kind of data matrices. Motivated by the Netflix matrix, we assume that , , and is a small positive constant. We assume further more that , the variance of the noise, is also of order (Since the average element of has order , it is reasonable to assume that the noise does not exceed this order.) In order to compare to previous results, we also assume that .

Theorem 15.

For any constants , there is a constant such that the following holds for all sufficiently large . With probability at least ,

(18)

This bound is comparable to (14). Notice that under our assumption, , thus, . If is also of order , then the bound becomes . As has rank at most , it implies , which is comparable to (13).

Apparently, (18) loses its strength if is significantly smaller than . In this case, however, it is more reasonable to approximate by , ignoring the insignificant last singular value/vector. Again this can be done using Theorem 13, which allows us to bound for any .

Notice that when we apply Theorem 13 for , the quantity plays a role (for as ). Thus, when we use as the estimator in the case is too small, we face a new danger that is small. The solution is simple, if and are both small, then is also small, and we can omit it and move to . Since , an average gap is of order , so sooner or later we will arrive at a sufficiently large gap. Under the same conditions on , we have

Theorem 16.

For any constants , there is a constant such that the following holds for all sufficiently large . For any , we have, with probability at least ,

(19)

Since , (19) implies

(20)

The minimum of the right-hand side (over ) gives the best approximation of using our method.

As mentioned earlier, a key new point here is that this method also allows us to approximate any given row or column of . Let and be the first column of and , respectively. Following (16), we have

(21)

where is the first column of . As an analogue of (17), we can bound this by

(22)

where is the first column of . The first two terms can be bounded by , and can be bounded by the observation discussed following (17). We have, under the same assumptions on , the following result.

Theorem 17.

For any constants , there is a constant such that the following holds for all sufficiently large . With probability at least ,

(23)

Appendix A Proof of Lemma 9

Proof of Lemma 9.

We begin by noting that

due to Lemma 8 and the assumption that . Thus, applying [29, Remark 4.12] (equivalently, one can apply [29, Theorem 4.9] with ), we find that

for every , every , and some absolute constant . Taking

we conclude that

for every and every . The bound in (9) now follows from the definitions of and given in Lemma 8. The bound in (10) follows from (9) by taking . ∎

Appendix B Proof of Theorem 4

Proof of Theorem 4.

For the lower bound, by definition,

By Lemma 11, with probability at least ,

(24)

In particular, by our assumption on , we conclude with probability at least . Conditioning on this event, we now consider the upper bound. Similarly, by definition, we first have

Since and are random and they are correlated with , we decompose them using the singular subspaces as follows

As mentioned earlier, we abuse the notation a little bit, using to denote both the matrix formed by the singular vectors and the subspace spanned by them. Thus is a matrix, and is the projection on the orthogonal complement of the subspace . By the triangle inequality, we have