Tighter Low-rank Approximation via Sampling the Leveraged Element

Tighter Low-rank Approximation via Sampling the Leveraged Element

The University of Texas at Austin
Prateek Jain
Microsoft Research, India
prajain@microsoft.com
Sujay Sanghavi
The University of Texas at Austin
sanghavi@mail.utexas.edu
Abstract

In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimization over the factored form of the intended low-rank matrix, to minimize error only on these samples. Our method can leverage input sparsity, yet produce approximations in spectral (as opposed to the weaker Frobenius) norm; this combines the best aspects of otherwise disparate current results, but with a dependence on the condition number . In particular we require computations to generate a rank- approximation to in spectral norm. In contrast, the best existing method requires time to compute an approximation in Frobenius norm. Besides the tightness in spectral norm, we have a better dependence on the error . Our method is naturally and highly parallelizable.

Our new approach enables two extensions that are interesting on their own. The first is a new method to directly compute a low-rank approximation (in efficient factored form) to the product of two given matrices; it computes a small random set of entries of the product, and then executes weighted alternating minimization (as before) on these. The sampling strategy is different because now we cannot access leverage scores of the product matrix (but instead have to work with input matrices). The second extension is an improved algorithm with smaller communication complexity for the distributed PCA setting (where each server has small set of rows of the matrix, and want to compute low rank approximation with small amount of communication with other servers).

1 Introduction

Finding a low-rank approximation to a matrix is fundamental to a wide array of machine learning techniques. The large sizes of modern data matrices has driven much recent work into efficient (typically randomized) methods to find low-rank approximations that do not exactly minimize the residual, but run much faster / parallel, with fewer passes over the data. Existing approaches involve either intelligent sampling of a few rows / columns of the matrix, projections onto lower-dimensional spaces, or sampling of entries followed by a top- SVD of the resulting matrix (with unsampled entries set to 0).

We pursue a different approach: we first sample entries in a specific biased random way, and then minimize the error on these samples over a search space that is the factored form of the low-rank matrix we are trying to find. We note that this is different from approximating a 0-filled matrix; it is instead reminiscent of matrix completion in the sense that it only looks at errors on the sampled entries. Another crucial ingredient is how the sampling is done: we use a combination of sampling, and of a distribution where the probability of an element is proportional to the sum of the leverage scores of its row and its column.

Both the sampling and the subsequent alternating minimization are naturally fast, parallelizable, and able to utilize sparsity in the input matrix. Existing literature has either focused on running in input sparsity time but approximation in (the weaker) Frobenius norm, or running in time with approximation in spectral norm. Our method provides the best of both worlds: it runs in input sparsity time, with just two passes over the data matrix, and yields an approximation in spectral norm. It does however have a dependence on the ratio of the first to the singular value of the matrix.

Our alternative approach also yields new methods for two related problems: directly finding the low-rank approximation of the product of two given matrices, and distributed PCA.

Our contributions are thus three new methods in this space:

• Low-rank approximation of a general matrix: Our first (and main) contribution is a new method (LELA, Algorithm 1) for low-rank approximation of any given matrix: first draw a random subset of entries in a specific biased way, and then execute a weighted alternating minimization algorithm that minimizes the error on these samples over a factored form of the intended low-rank matrix. The sampling is done with only two passes over the matrix (each in input sparsity time), and both the sampling and the alternating minimization steps are highly parallelizable and compactly stored/manipulated.

For a matrix , let be the best rank- approximation (i.e. the matrix corresponding to top components of SVD). Our algorithm finds a rank- matrix in time , while providing approximation in spectral norm: , where is the condition number of . Existing methods either can run in input sparsity time, but provide approximations in (the weaker) Frobenius norm (i.e. with replaced by in the above expression), or run in time to approximate in spectral norm, but even then with leading constants larger than 1. Our method however does have a dependence on , which these do not. See Table 1 for a detailed comparison to existing results for low-rank approximation.

• Direct approximation of a matrix product: We provide a new method to directly find a low-rank approximation to the product of two matrices, without having to first compute the product itself. To do so, we first choose a small set of entries (in a biased random way) of the product that we will compute, and then again run weighted alternating minimization on these samples. The choice of the biased random distribution is now different from above, as the sampling step does not have access to the product matrix. However, again both the sampling and alternating minimization are highly parallelizable.

For , , and , our algorithm first chooses entries of the product that it needs to sample; each sample takes time individually, since it is a product of two length- vectors (though these can be parallelized). The weighted alternating minimization then runs in time (where ). This results in a rank- approximation of in spectral norm, as given above.

• Distributed PCA: Motivated by applications with really large matrices, recent work has looked at low-rank approximation in a distributed setting where there are servers – each have small set of rows of the matrix – each of which can communicate with a central processor charged with coordinating the algorithm. In this model, one is interested in find good approximations while minimizing both computations and the communication burden on the center.

We show that our LELA algorithm can be extended to the distributed setting while guaranteeing small communication complexity. In particular, our algorithm guarantees the same error bounds as that of our non-distributed version but guarantees communication complexity of real numbers for computing rank- approximation to . For and large , our analysis guarantees significantly lesser communication complexity than the state-of-the-art method [22], while providing tighter spectral norm bounds.

Notation: Capital letter typically denotes a matrix. denotes the -th row of , denotes the -th column of , and denotes the -th element of . Unless specified otherwise, and is the best rank- approximation of . Also, denotes the SVD of . denotes the condition number of , where is the -th singular value of . denotes the norm of vector . denotes the spectral or operator norm of . denotes the Frobenius norm of . Also, . denotes the principal angle based distance between subspaces spanned by and orthonormal matrices. Typically, denotes a global constant independent of problem parameters and can change from step to step.

Given a set , is given by: if and otherwise. denotes the Hadamard product of and . That is, if and otherwise. Similarly let if and otherwise.

2 Related results

Low rank approximation: Now we will briefly review some of the existing work on algorithms for low rank approximation. [14] introduced the problem of computing low rank approximation of a matrix with few passes over . They presented an algorithm that samples few rows and columns and does SVD to compute low rank approximation, and gave additive error guarantees. [9, 10] have extended these results. [2] considered a different approach based on entrywise sampling and quantization for low rank approximation and has given additive error bounds.

[18, 30, 11, 8] have given low rank approximation algorithms with relative error guarantees in Frobenius norm. [32, 28] have provided guarantees on error in spectral norm which are later improved in [17, 3]. The main techniques of these algorithms is to use a random Gaussian or Hadamard transform matrix for projecting the matrix onto a low dimensional subspace and compute the rank- subspace. [3] have given an algorithm based on random Hadamard transform that computes rank- approximation in time and gives spectral norm bound of .

One drawback of Hadamard transform is that it cannot take advantage of sparsity of the input matrix. Recently [7] gave an algorithm using sparse subspace embedding that runs in input sparsity time with relative Frobenius norm error guarantees.

We presented some results in this area as a comparison with our results in table 1. This is a heavily subsampled set of existing results on low rank approximations. There is a lot of interesting work on very related problems of computing column/row based(CUR) decompositions, matrix sketching, low rank approximation with streaming data. Look at [27, 17] for more detailed discussion and comparison.

Matrix sparsification: In the matrix sparsification problem, the goal is to create a sparse sketch of a given matrix by sampling and reweighing the entries of the matrix. Various techniques for sampling have been proposed and analyzed which guarantee approximation error in Frobenius norm with samples  [12, 1]. As we will see in the next section, the first step of algorithm 1 involves sampling according to a very specific distribution (similar to matrix sparsification), which has been designed for guaranteeing good error bounds for computing low rank approximation. For a comparison of various sampling distributions for the problem of low rank matrix recovery see [6].

Matrix completion: Matrix completion problem is to recover a rank- matrix from observing small number of () random entries. Nuclear norm minimization is shown to recover the matrix from uniform random samples if the matrix is incoherent111A matrix of rank- with SVD is incoherent if and for some constant . [4, 5, 29, 16] . Similar results are shown for other algorithms like OptSpace [23] and alternating minimization [21, 19, 20]. Recently [6] has given guarantees for recovery of any matrix under leverage score sampling from entries.

Distributed PCA: In distributed PCA, one wants to compute rank- approximation of a matrix that is stored across servers with small communication between servers. One popular model is row partition model where subset of rows are stored at each server. Algorithms in [13, 25, 15, 24] achieve communication complexity with relative error guarantees in Frobenius norm, under this model. Recently [22] have considered the scenario of arbitrary splitting of a matrix and given an algorithm that has communication complexity with relative error guarantees in Frobenius norm.

3 Low-rank Approximation of Matrices

In this section we will present our main contribution: a new randomized algorithm for computing low-rank approximation of any given matrix. Our algorithm first samples a few elements from the given matrix , and then rank- approximation is computed using only those samples. Algorithm 1 provides a detailed pseudo-code of our algorithm; we now comment on each of the two stages:

Sampling: A crucial ingredient of our approach is using the correct sampling distribution. Recent results in matrix completion [6] indicate that a small number () of samples drawn in a way biased by leverage scores222If SVD of then leverage scores of are and for all . can capture all the information in any exactly low-rank matrix. While this is indicative, here we have neither access to the leverage scores, nor is our matrix exactly low-rank. We approximate the leverage scores with the row and column norms ( and ), and account for the arbitrary high-rank nature of input by including an term in the sampling; the distribution is given in eq. (2). Computationally, our sampling procedure can be done in two passes and time.

Weighted alternating minimization: In our second step, we directly optimize over the factored form of the intended low-rank matrix, by minimizing a weighted squared error over the sampled elements from stage 1. That is, we first express the low-rank approximation as and then iterate over and alternatingly to minimize the weighted error over the sampled entries (see Sub-routine 2). Note that this is different from taking principal components of a 0-filled version of the sampled matrix. The weights give higher emphasis to elements with smaller sampling probabilities. In particular, the goal is to minimize the following objective function:

 Err(ˆMr)=∑(i,j)∈Ωwij(Mij−(ˆMr)ij)2, (1)

where when , else. For initialization of the WAltMin procedure, we compute SVD of (reweighed sampled matrix) followed by a trimming step (see Step 4, 5 of Sub-routine 2). Trimming step sets if and otherwise; and is the orthonormal matrix spanning the column space of . This step prevents heavy rows/columns from having undue influence.

We now provide our main result for low-rank approximation and show that Algorithm 1 can provide a tight approximation to while using a small number of samples .

Theorem 3.1.

Let be any given matrix () and let be the best rank- approximation to . Set the number of samples , where is any global constant, where is the -th singular value of . Also, set the number of iterations of WAltMin procedure to be . Then, with probability greater than for any constant , the output of Algorithm 1 with the above specified parameters , satisfies:

 ∥M−ˆMr∥≤∥M−Mr∥+ϵ∥M−Mr∥F+ζ.

That is, if , we have:

 ∥M−ˆMr∥≤∥M−Mr∥+2ϵ∥M−Mr∥F.

Note that our time and sample complexity depends quadratically on . Recent results in the matrix completion literature shows that such a dependence can be improved to by using a slightly more involved analysis [20]. We believe a similar analysis can be combined with our techniques to obtain tighter bounds; we leave a similar tighter analysis for future research as such a proof would be significantly more tedious and would take away from the key message of this paper.

3.1 Computation complexity:

In the first step we take 2 passes over the matrix to compute the sampling distribution (2) and sampling the entries based on this distribution. It is easy to show that this step would require time. Next, the initialization step of WAltMin procedure requires computing rank- SVD of which has at most non-zero entries. Hence, the procedure can be completed in time using standard techniques like power method. Note that by Lemma 3.2 we need top- singular vectors of only upto constant approximation. Further -th iteration of alternating minimization takes time. So, the total time complexity of our method is . As shown in Theorem 3.1, our method requires samples. Hence, the total run-time of our algorithm is: .

Remarks: Now we will discuss how to sample entries of using sampling method (2) in time. Consider the following multinomial based sampling model: sample the number of elements per row (say ) by doing draws using a multinomial distribution over the rows, given by . Then, sample elements of the row-, using over , with replacement.

The failure probability in this model is bounded by 2 times the failure probability if the elements are sampled according to (2) [4]. Hence, we can instead use the above mentioned multinomial model for sampling. Moreover, , and can be computed in time , so ’s can be sampled efficiently. Moreover, the multinomial distribution for all the rows can be computed in time , work for setting up the first term and term for changing the base distribution wherever is non-zero. Hence, the total time complexity is .

3.2 Proof Overview:

We now present the key steps in our proof of Theorem 3.1. As mentioned in the previous section, our algorithm proceeds in two steps: entry-wise sampling of the given matrix and then weighted alternating minimization (WAltMin) to obtain a low-rank approximation of .

Hence, the goal is to analyze the WAltMin procedure, with input samples obtained using (2), to obtain the bounds in Theorem 3.1. Now, WAltMin is an iterative procedure solving an inherently non-convex problem, . Hence, it is prone to local minimas or worse, might not even converge. However, recent results for low-rank matrix completion have shown that alternating minimization (with appropriate initialization) can indeed be analyzed to obtain exact matrix completion guarantees.

Our proof also follows along similar lines, where we show that the initialization procedure (step 4 of Sub-routine 2) provides an accurate enough estimate of and then at each step, we show a geometric decrease in distance to . However, our proof differs from the previous works in two key aspects: a) existing proof techniques of alternating minimization assume that each element is sampled uniformly at random, while we can allow biased and approximate sampling, b) existing techniques crucially use the assumption that is incoherent, while our proof avoids this assumption using the weighted version of AltMin.

We now present our bounds for initialization as well as for each step of the WAltMin procedure. Theorem 3.1 follows easily from the two bounds.

Lemma 3.2 (Initialization).

Let the set of entries be generated according to (2). Also, let . Then, the following holds (w.p. ):

 ∥RΩ(M)−M∥≤δ∥M∥F. (3)

Also, if , then the following holds (w.p. ):

 ∥(ˆU(0))i∥≤8√r√∥Mi∥2/∥M∥2F  and  dist(ˆU(0),U∗)≤12,

where is the initial iterate obtained using Steps 4, 5 of Sub-Procedure 2. , is the -th singular value of , .

Let be the best rank- approximation of . Then, Lemma 3.2 and Weyl’s inequality implies that:

 ∥M−Pr(RΩ(M))∥≤∥M−RΩ(M)∥+∥RΩ(M)−Pr(RΩ(M))∥≤∥M−Mr∥+2δ∥M∥F. (4)

Now, we can have two cases: 1) : In this case, setting in (4) already implies the required error bounds of Theorem 3.1333There is a small technicality here: alternating minimization can potentially worsen this bound. But the error after each step of alternating minimization can be effectively checked using a small cross-validation set and we can stop if the error increases.. 2) . In this regime, we will show now that alternating minimization reduces the error from initial to .

Lemma 3.3 (WAltMin Descent).

Let hypotheses of Theorem 3.1 hold. Also, let . Let be the -th step iterate of Sub-Procedure 2 (called from Algorithm 1), and let be the -th iterate (for ). Also, let and , where is a set of orthonormal vectors spanning . Then, the following holds (w.p. ):

 dist(V(t+1),V∗)≤12dist(U(t),U∗)+ϵ∥M−Mr∥F/σ∗r,

and , where is a set of orthonormal vectors spanning .

The above lemma shows that distance between and (and similarly, and ) decreases geometrically up to . Hence, after steps, the first error term in the bounds above vanishes and the error bound given in Theorem 3.1 is obtained.

Note that, the sampling distribution used for our result is a “hybrid” distribution combining leverage scores and the -style sampling. However, if is indeed a rank- matrix, then our analysis can be extended to handle the leverage score based sampling itself (). Hence our results also show that weighted alternating minimization can be used to solve the coherent-matrix completion problem introduced in [6].

3.3 Direct Low-rank Approximation of Matrix Product

In this section we present a new pass efficient algorithm for the following problem: suppose we are given two matrices, and desire a low-rank approximation of their product ; in particular, we are not interested in the actual full matrix product itself (as this may be unwieldy to store and use, and thus wasteful to produce in its entirety). One example setting where this arises is when one wants to calculate the joint counts between two very large sets of entities; for example, web companies routinely come across settings where they need to understand (for example) how many users both searched for a particular query and clicked on a particular advertisement. The number of possible queries and ads is huge, and finding this co-occurrence matrix from user logs involves multiplying two matrices – query-by-user and user-by-ad respectively – each of which is itself large.

We give a method that directly produces a low-rank approximation of the final product, and involves storage and manipulation of only the efficient factored form (i.e. one tall and one fat matrix) of the final intended low-rank matrix. Note that as opposed to the previous section, the matrix does not already exist and hence we do not have access to its row and column norms; so we need a new sampling scheme (and a different proof of correctness).

Algorithm: Suppose we are given an matrix and another matrix , and we wish to calculate a rank- approximation of the product . Our algorithm proceeds in two stages:

1. Choose a biased random set of elements as follows: choose an intended number (according to Theorem 3.4 below) of sampled elements, and then independently include each in with probability given by where

 qij := m⋅(∥Ai∥2n2∥A∥2F+∥Bj∥2n1∥B∥2F), (5)

Then, find , i.e. only the elements of the product that are in this set .

2. Run the alternating minimization procedure WAltMin, where is the number of iterations (again chosen according to Theorem 3.4 below). This produces the low-rank approximation in factored form.

Remarks: Note that the sampling distribution now depends only on the row norms of and the column norms of ; each of these can be found completely in parallel, with one pass over each row/column of the matrices / . A second pass, again parallelizable, calculates the element of the product, for . Once this is done, we are again in the setting of doing weighted alternating minimization over a small set of samples – the setting we had before, and as already mentioned this too is highly parallelizable and very fast overall. In particular, the computation complexity of the algorithm is (suppressing terms dependent on norms of and ), where .

We now present our theorem on the number of samples and iterations needed to make this procedure work with at least a constant probability.

Theorem 3.4.

Consider matrices and and let , where , is the -th singular value of and . Let be sampled using probability distribution (5). Then, the output of Sub-routine 2 satisfies (w.p. ):

Next, we show an application of our matrix-multiplication approach to approximating covariance matrices , where is a sample matrix. Note, that a rank- approximation to can be computed by computing low rank approximation of , , i.e., . However, as we show below, such an approach leads to weaker bounds as compared to computing low rank approximation of :

Corollary 3.5.

Let and let be sampled using probability distribution (5) with , the output of satisfy (w.p. ):

 ∥ˆMr−(YYT)r∥≤ϵ∥Y−Yr∥2F+ζ.

Further when we get:

 ∥ˆMr−(YYT)r∥≤ϵ∥∥YYT−(YYT)r∥∥F+ζ.

Now, one can compute in time with error . Where as computing low rank approximation of gives(from Corollary 3.5) , which can be much smaller than . The difference in error is a consequence of larger gap between singular values of compared to . For related discussion and applications see section 4.5 of  [17].

4 Distributed Principal Component Analysis

Modern large-scale systems have to routinely compute PCA of data matrices with millions of data points embedded in similarly large number of dimensions. Now, even storing such matrices on a single machine is not possible and hence most industrial scale systems use distributed computing environment to handle such problems. However, performance of such systems depend not only on computation and storage complexity, but also on the required amount of communication between different servers.

In particular, we consider the following distributed PCA setting: Let be a given matrix (assume but ). Also, let be row partitioned among servers and let be the matrix with rows of and rest filled with zeros, stored on -th server. Moreover, we assume that one of the servers act as Central Processor(CP) and in each round all servers communicate with the CP and the CP communicates back with all the servers. Now, the goal is to compute , an estimate of , such that the total communication (i.e. number of bits transferred) between CP and other servers is minimized. Note that, such a model is now standard for this problem and was most recently studied by [22].

Recently several interesting results [13, 15, 24, 22] have given algorithms to compute rank- approximation of , in the above mentioned distributed setting. In particular, [22] proposed a method that for row-partitioned model requires communication to obtain a relative Frobenius norm guarantee,

 ||M−~Mr||F≤(1+ϵ)||M−Mr||F.

In contrast, a distributed setting extension of our LELA algorithm 1 has linear communication complexity and computes rank- approximation , with . Now note that if and if scales with (which is a typical requirement), then our communication complexity can be significantly better than that of [22]. Moreover, our method provides spectral norm bounds as compared to relatively weak Frobenius bounds mentioned above.

Algorithm: The distributed version of our LELA algorithm depends crucially on the following observation: given , each row of can be updated independently. Hence, servers need to communicate rows of only. There also, we can use the fact that each server requires only rows of to update their corresponding . denote restriction of to rows in set and outside and similarly denote restriction of to rows .

We now describe the distributed version of each of the critical step of LELA algorithm. See Algorithm 3 for a detailed pseudo-code. For simplicity, we dropped the use of different set of samples in each iteration. Correspondingly the algorithm will modify to distributing samples into buckets and using one in each iteration. This simplification doesn’t change the communication complexity.

Sampling: For sampling, we first compute column norms and communicate to each server. This operation would require communication. Next, each server (server ) samples elements from its rows and stores locally. Note that because of independence over rows, the servers don’t need to transmit their samples to other servers.

Initialization: In the initialization step, our algorithm computes top right singular vector of by iterations . Now, note that computing requires server to access atmost columns of . Hence, the total communication from the CP to all the servers in this round is . Similarly, each column of is only sparse. Hence, total communication from all the servers to CP in this round is . Now, we need constant many rounds to get a constant factor approximation to SVD of , which is enough for good initialization in WAltMin procedure. Hence, total communication complexity of the initialization step would be .

Alternating Minimization Step: For alternating minimization, update to rows of is computed at the corresponding servers and the update to is computed at the CP. For updating at server , we use the following observation: updating requires atmost rows of . Hence, the total communication from CP to all the servers in the -th iteration is . Next, we make a critical observation that update can be computed by adding certain messages from each server (see Algorithm 3 for more details). Message from server to CP is of size . Hence, total communication complexity in each round is and total number of rounds is .

We now combine the above given observations to provide error bounds and communication complexity of our distributed PCA algorithm:

Theorem 4.1.

Let the matrix be distributed over servers according to the row-partition model. Let . Then, the algorithm 3 on completion will leave matrices at server and at CP such that the following holds (w.p. ): , where . This algorithm has a communication complexity of real numbers.

As discussed above, each update to and are computed exactly as given in the WAltMin procedure (Sub-routine 2). Hence, error bounds for the algorithm follows directly from Theorem 3.1. Communication complexity bounds follows by observing that w.h.p.

Remark: The sampling step given above suggests another simple algorithm where we can compute in a distributed fashion and communicate the samples to CP. All the computation is performed at CP afterwards. Hence, the total communication complexity would be