Low Rank Approximation with Entrywise \ell_{1}-Norm Error

We study the -low rank approximation problem, where for a given matrix and approximation factor , the goal is to output a rank- matrix for which

 ∥A−ˆA∥1≤α⋅minrank-k matrices\leavevmode\nobreak A′∥A−A′∥1,

where for an matrix , we let . This error measure is known to be more robust than the Frobenius norm in the presence of outliers and is indicated in models where Gaussian assumptions on the noise may not apply. The problem was shown to be NP-hard by Gillis and Vavasis and a number of heuristics have been proposed. It was asked in multiple places if there are any approximation algorithms.

We give the first provable approximation algorithms for -low rank approximation, showing that it is possible to achieve approximation factor in time, where denotes the number of non-zero entries of . If is constant, we further improve the approximation ratio to with a -time algorithm. Under the Exponential Time Hypothesis, we show there is no -time algorithm achieving a -approximation, for an arbitrarily small constant, even when .

We give a number of additional results for -low rank approximation: nearly tight upper and lower bounds for column subset selection, CUR decompositions, extensions to low rank approximation with respect to -norms for and earthmover distance, low-communication distributed protocols and low-memory streaming algorithms, algorithms with limited randomness, and bicriteria algorithms. We also give a preliminary empirical evaluation.

## 1 Introduction

Two well-studied problems in numerical linear algebra are regression and low rank approximation. In regression, one is given an matrix , and an vector , and one seeks an which minimizes under some norm. For example, for least squares regression one minimizes . In low rank approximation, one is given an matrix , and one seeks a rank- matrix which minimizes under some norm. For example, in Frobenius norm low rank approximation, one minimizes . Algorithms for regression are often used as subroutines for low rank approximation. Indeed, one of the main insights of [DMM06b, DMM06a, Sar06, DMM08, CW09] was to use results for generalized least squares regression for Frobenius norm low rank approximation. Algorithms for -regression, in which one minimizes , were also used [BD13, SW11] to fit a set of points to a hyperplane, which is a special case of entrywise -low rank approximation, the more general problem being to find a rank- matrix minimizing .

Randomization and approximation were introduced to significantly speed up algorithms for these problems, resulting in algorithms achieving relative error approximation with high probability. Such algorithms are based on sketching and sampling techniques; we refer to [Woo14b] for a survey. For least squares regression, a sequence of work [Sar06, CW13, MM13, NN13, LMP13, BDN15, Coh16] shows how to achieve algorithms running in time. For Frobenius norm low rank approximation, using the advances for regression this resulted in time algorithms. For -regression, sketching and sampling-based methods [Cla05, SW11, CDMI13, CW13, MM13, LMP13, WZ13, CW15b, CP15] led to an time algorithm.

Just like Frobenius norm low rank approximation is the analogue of least squares regression, entrywise -low rank approximation is the analogue of -regression. Despite this analogy, no non-trivial upper bounds with provable guarantees are known for -low rank approximation. Unlike Frobenius norm low rank approximation, which can be solved exactly using the singular value decomposition, no such algorithm or closed-form solution is known for -low rank approximation. Moreover, the problem was recently shown to be NP-hard [GV15]. A major open question is whether there exist approximation algorithms, sketching-based or otherwise, for -low rank approximation. Indeed, the question of obtaining betters algorithms was posed in section 6 of [GV15], in [Exc13], and as the second part of open question 2 in [Woo14b], among other places. The earlier question of NP-hardness was posed in Section 1.4 of [KV09], for which the question of obtaining approximation algorithms is a natural followup. The goal of our work is to answer this question.

We now formally define the -low rank approximation problem: we are given an matrix and approximation factor , and we would like, with large constant probability, to output a rank- matrix for which

 ∥A−ˆA∥1≤α⋅minrank-k% matrices\leavevmode\nobreak\ A′∥A−A′∥1, (1)

where for an matrix , we let . This notion of low rank approximation has been proposed as a more robust alternative to Frobenius norm low rank approximation [KK03, KK05, KLC15, Kwa08, ZLS12, BJ12, BD13, BDB13, MXZZ13, MKP13, MKP14, MKCP16, PK16], and is sometimes referred to as -matrix factorization or robust PCA. -low rank approximation gives improved results over Frobenius norm low rank approximation since outliers are less exaggerated, as one does not square their contribution in the objective. The outlier values are often erroneous values that are far away from the nominal data, appear only a few times in the data matrix, and would not appear again under normal system operation. These works also argue -low rank approximation can better handle missing data, is appropriate in noise models for which the noise is not Gaussian, e.g., it produces the maximum likelihood estimator for Laplacian noise [Gao08, KAC08, VT01], and can be used in image processing to prevent image occlusion [YZD12].

To see that -low rank approximation and Frobenius norm low rank approximation can give very different results, consider the matrix , where is any matrix with . The best rank- approximation with Frobenius norm error is given by , where is the first standard unit vector. Here ignores all but the first row and column of , which may be undesirable in the case that this row and column represent an outlier. Note . If, for example, is the all s matrix, then is a rank- approximation for which , and therefore this solution is a much better solution to the -low rank approximation problem than , for which

Despite the advantages of -low rank approximation, its main disadvantage is its computationally intractability. It is not rotationally invariant and most tools for Frobenius low rank approximation do not apply. To the best of our knowledge, all previous works only provide heuristics. We provide hard instances for previous work in Section L, showing these algorithms at best give a -approximation (though even this is not shown in these works). We also mention why a related objective function, robust PCA [WGR09, CLMW11, NNS14, NYH14, CHD16, ZZL15], does not give a provable approximation factor for -low rank approximation. Using that for an matrix , , a Frobenius norm low rank approximation gives a approximation for -low rank approximation. A bit better is to use algorithms for low rank approximation with respect to the sum of distances, i.e., to find a rank- matrix minimizing , where for an matrix , , where is the -th row of . A sequence of work [DV07, FMSW10, FL11, SV12, CW15a] shows how to obtain an -approximation to this problem in time, and using that results in an -approximation.

There are also many variants of Frobenius norm low rank approximation for which nothing is known for -low rank approximation, such as column subset selection and CUR decompositions, distributed and streaming algorithms, algorithms with limited randomness, and bicriteria algorithms. Other interesting questions include low rank approximation for related norms, such as -low rank approximation in which one seeks a rank- matrix minimizing . Note for these are also more robust than the SVD.

### 1.1 Our Results

We give the first efficient algorithms for -low rank approximation with provable approximation guarantees. By symmetry of the problem, we can assume . We first give an algorithm which runs in time and solves the -low rank approximation problem with approximation factor . This is an exponential improvement over the previous approximation factor of , provided is not too large, and is polynomial time for every . Moreover, provided , our time is optimal up to a constant factor as any relative error algorithm must spend time. We also give a hard instance for our algorithm ruling out approximation for arbitrarily small constant , and hard instances for a general class of algorithms based on linear sketches, ruling out approximation.

Via a different algorithm, we show how to achieve an -approximation factor in
time. This is useful for constant , for which it gives an -approximation in time, improving the -approximation for constant of our earlier algorithm. The approximation ratio of this algorithm, although for constant , depends on . We also show one can find a rank- matrix in time for constant for which , where is an absolute constant independent of . We refer to this as a bicriteria algorithm. Finally, one can output a rank- matrix , instead of a rank- matrix , in time with the same absolute constant approximation factor, under an additional assumption that the entries of are integers in the range for an integer . Unlike our previous algorithms, this last algorithm has a bit complexity assumption.

Under the Exponential Time Hypothesis (ETH), we show there is no -time algorithm achieving a -approximation, for an arbitrarily small constant, even when . The latter strengthens the NP-hardness result of [GV15].

We also give a number of results for variants of -low rank approximation which are studied for Frobenius norm low rank approxiation; prior to our work nothing was known about these problems.

Column Subset Selection and CUR Decomposition: In the column subset selection problem, one seeks a small subset of columns of for which there is a matrix for which is small, under some norm. The matrix provides a low rank approximation to which is often more interpretable, since it stores actual columns of , preserves sparsity, etc. These have been extensively studied when the norm is the Frobenius or operator norm (see, e.g., [BMD09, DR10, BDM11] and the references therein). We initiate the study of this problem with respect to the -norm. We first prove an existence result, namely, that there exist matrices for which any subset of columns satisfies , where is an arbitrarily small constant. This result is in stark contrast to the Frobenius norm for which for every matrix there exist columns for which the approximation factor is . We also show that our bound is nearly optimal in this regime, by showing for every matrix there exists a subset of columns providing an -approximation. One can find such columns in time by enumerating and evaluating the cost of each subset. Although this is exponential in , we show it is possible to find columns providing an -approximation in polynomial time for every .

We extend these results to the CUR decomposition problem (see, e.g., [DMM08, BW14]), in which one seeks a factorization for which is a subset of columns of , is a subset of rows of , and is as small as possible. In the case of Frobenius norm, one can choose columns and rows, have rank, have be at most times the optimal cost, and find the factorization in time [BW14]. Using our column subset selection results, we give an time algorithm choosing columns and rows, for which rank, and for which is times the cost of any rank- approximation to .

-Low Rank Approximation and EMD-Low Rank Approximation: We also give the first algorithms with provable approximation guarantees for the -low rank approximation problem, , in which we are given an matrix and approximation factor , and would like, with large constant probability, to output a rank- matrix for which

 ∥A−ˆA∥pp≤α⋅minrank-k matrices\leavevmode\nobreak\ A′∥A−A′∥pp, (2)

where for an matrix , . We obtain similar algorithms for this problem as for -low rank approximation. For instance, we obtain an time algorithm with approximation ratio . We also provide the first low rank approximation with respect to sum of earthmover distances (of the rows of and ) with a approximation factor. This low rank error measure was used, e.g., in [SL09]. Sometimes such applications also require a non-negative factorization, which we do not provide.

Distributed/Streaming Algorithms, and Algorithms with Limited Randomness: There is a growing body of work on low rank approximation in the distributed (see, e.g., [TD99, QOSG02, BCL05, BRB08, MBZ10, FEGK13, PMvdG13, KVW14, BKLW14, BLS16, BWZ16, WZ16]) and streaming models (see, e.g., [McG06, CW09, KL11, GP13, Lib13, KLM14, Woo14a]), though almost exclusively for the Frobenius norm. One distributed model is the arbitrary partition model [KVW14] in which there are servers, each holding an matrix , and they would like to output a matrix for which is as small as possible (or, a centralized coordinator may want to output this). We give -communication algorithms achieving a -approximation for -low rank approximation in the arbitrary partition model, which is optimal for this approximation factor (see [BW14] where lower bounds for Frobenius norm approximation with multiplicative approximation were shown - such lower bounds apply to low rank approximation). We also consider the turnstile streaming model [Mut05] in which we receive positive or negative updates to its entries and wish to output a rank- factorization at the end of the stream. We give an algorithm using space to achieve a -approximation, which is space-optimal for this approximation factor, up to the degree of the factor. To obtain these results, we show our algorithms can be implemented using random bits.

We stress for all of our results, we do not make assumptions on such as low coherence or condition number; our results hold for any input matrix .

We report a promising preliminary empirical evaluation of our algorithms in Section L.

###### Remark 1.1.

We were just informed of the concurrent and independent work [CGK16], which also obtains approximation algorithms for -low rank approximation. That paper obtains a -approximation in time. Their algorithm is not polynomial time once , whereas we obtain a polynomial time algorithm for every (in fact time). Our approximation factor is also , which is an exponential improvement over theirs in terms of . In [CGK16] they also obtain a -approximation in time. In contrast, we obtain an -approximation in time. The dependence in [CGK16] on in the approximation ratio is exponential, whereas ours is polynomial.

### 1.2 Technical Overview

Initial Algorithm and Optimizations: Let be a rank- matrix for which . Let be a factorization for which is and is . Suppose we somehow knew and consider the multi-response -regression problem , where denote the -th columns of and , respectively. We could solve this with linear programming though this is not helpful for our argument here.

Instead, inspired by recent advances in sketching for linear algebra (see, e.g., [Woo14b] for a survey), we could choose a random matrix and solve . If is an approximate minimizer of the latter problem, we could hope is an approximate minimizer of the former problem. If also has a small number of rows, then we could instead solve , that is, minimize the sum of Euclidean norms rather than the sum of -norms. Since , we would obtain a -approximation to the problem . A crucial observation is that the solution to is given by , which implies that is in the row span of . If also were oblivious to , then we could compute without ever knowing . Having a low-dimensional space containing a good solution in its span is our starting point.

For this to work, we need a distribution on oblivious matrices with a small number of rows, for which an approximate minimizer to is also an approximate minimizer to . It is unknown if there exists a distribution on with this property. What is known is that if has rows, then the Lewis weights (see, e.g., [CP15] and references therein) of the concatenated matrix give a distribution for which the optimal for the latter problem is a -approximation to the former problem; see also earlier work on -leverage scores [Cla05, DDH09] which have rows and the same -approximation guarantee. Such distributions are not helpful here as (1) they are not oblivious, and (2) the number of rows gives an approximation factor, which is much larger than what we want.

There are a few oblivious distributions which are useful for single-response -regression for column vectors [SW11, CDMI13, WZ13]. In particular, if is an matrix of i.i.d. Cauchy random variables, then the solution to is an -approximation to [SW11]. The important property of Cauchy random variables is that if and are independent Cauchy random variables, then is distributed as a Cauchy random variable times , for any scalars . The approximation arises because all possible regression solutions are in the column span of which is -dimensional, and the sketch gives an approximation factor of to preserve every vector norm in this subspace. If we instead had a multi-response regression problem the dimension of the column span of would be , and this approach would give an -approximation. Unlike Frobenius norm multi-response regression , which can be bounded if is a subspace embedding for and satisfies an approximate matrix product theorem [Sar06], there is no convenient linear-algebraic analogue for the -norm.

We first note that since regression is a minimization problem, to obtain an -approximation by solving the sketched version of the problem, it suffices that (1) for the optimal , we have , and (2) for all , we have .

We show (1) holds for and any number of rows of . Our analysis follows by truncating the Cauchy random variables for and , so that their expectation exists, and applying linearity of expectation across the columns. This is inspired from an argument of Indyk [Ind06] for embedding a vector into a lower-dimensional vector while preserving its -norm; for single-response regression this is the statement that , implied by [Ind06]. However, for multi-response regression we have to work entirely with expectations, rather than the tail bounds in [Ind06], since the Cauchy random variables , while independent across , are dependent across . Moreover, our -approximation factor is not an artifact of our analysis - we show in Section G that there is an input matrix for which with probability , there is no -dimensional space in the span of achieving a -approximation, for a Cauchy matrix with rows, where is an arbitrarily small constant. This shows -inapproximability. Thus, the fact that we achieve -approximation instead of is fundamental for a matrix of Cauchy random variables or any scaling of it.

While we cannot show (2), we instead show for all , if has rows. This suffices for regression, since the only matrices for which the cost is much smaller in the sketch space are those providing an approximation in the original space. The guarantee follows from the triangle inequality: and the fact that is known to not contract any vector in the column span of if has rows [SW11]. Because of this, we have , where we again use the triangle inequality. We also bound the additive term by using (1) above.

Given that contains a good rank- approximation in its row span, our algorithm with a slightly worse time and -approximation can be completely described here. Let and be independent matrices of i.i.d. Cauchy random variables, and let and be independent matrices of i.i.d. Cauchy random variables. Let

 X=(T1AR)†((T1AR)(T1AR)†(T1AT2)(SAT2)(SAT2)†)k(SAT2)†,

which is the rank- matrix minimizing , where for a matrix , is its best rank- approximation in Frobenius norm. Output as the solution to -low rank approximation of . We show with constant probability that is a -approximation.

To improve the approximation factor, after computing , we -project each of the rows of onto using linear programming or fast algorithms for -regression [CW13, MM13], obtaining an matrix of rank . We then apply the algorithm in the previous paragraph with replaced by . This ultimately leads to a -approximation.

To improve the running time from to , we show a similar analysis holds for the sparse Cauchy matrices of [MM13]; see also the matrices in [WZ13].

CUR Decompositions: To obtain a CUR decomposition, we first find a -approximate rank- approximation as above. Let be an matrix whose columns span those of , and consider the regression . Unlike the problem where was unknown, we know so can compute its Lewis weights efficiently, sample by them, and obtain a regression problem where is a sampling and rescaling matrix. Since

 ∥D1(B1V−A)∥1≤∥D1(B1V−B1V∗)∥1+∥D1(B1V∗−A)∥1,

where , we can bound the first term by using that is a subspace embedding if it has rows, while the second term is by a Markov bound. Note that . By switching to as before, we see that contains a -approximation in its span. Here is an actual subset of rows of , as required in a CUR decomposition. Moreover the subset size is . We can sample by the Lewis weights of to obtain a subset of rescaled columns of , together with a rank- matrix for which .

Algorithm for Small : Our CUR decomposition shows how we might obtain an -approximation for constant in time. If we knew the Lewis weights of , an -approximate solution to the problem would be an -approximate solution to the problem , where is a sampling and rescaling matrix of rows of . Moreover, an -approximate solution to is given by , which implies the rows of contain an -approximation. For small , we can guess every subset of rows of in time (if , by taking transposes at the beginning one can replace this with time). For each guess, we set up the problem . If is a sampling and rescaling matrix according to the Lewis weights of , then by a similar triangle inequality argument as for our CUR decomposition, minimizing gives an approximation. By switching to , this implies there is an -approximation of the form , where is an matrix of rank . By setting up the problem , one can sample from Lewis weights on the left and right to reduce this to a problem independent of and , after which one can use polynomial optimization to solve it in time. One of our guesses will be correct, and for this guess we obtain an -approximation. For each guess we can compute its cost and take the best one found. This gives an -approximation for constant , removing the -factor from the approximation of our earlier algorithm.

Existential Results for Subset Selection: In our algorithm for small , the first step was to show there exist rows of which contain a rank- space which is an -approximation.

While for Frobenius norm one can find rows with an -approximation in their span, one of our main negative results for -low rank approximation is that this is impossible, showing that the best approximation one can obtain with rows is for an arbitrarily small constant . Our hard instance is an matrix in which the first columns are i.i.d. Gaussian, and the remaining columns are an identity matrix. Here, can be twice the number of rows one is choosing. The optimal -low rank approximation has cost at most , obtained by choosing the first columns.

Let denote the first entries of the chosen rows, and let denote the first entries of an unchosen row. For , there exist many solutions for which . However, we can show the following tradeoff:

whenever , then .

Then no matter which linear combination of the rows of one chooses to approximate by, either one incurs a cost on the first coordinates, or since contains an identity matrix, one incurs cost on the last coordinates of .

To show the tradeoff, consider an . We decompose , where agrees with on coordinates which have absolute value in the range , and is zero otherwise. Here, is a constant, and denotes the restriction of to all coordinates of absolute value at least . Then , as otherwise we are done. Hence, has small support. Thus, one can build a small net for all vectors by choosing the support, then placing a net on it. For for , the support sizes are increasing so the net size needed for all vectors is larger. However, since has all coordinates of roughly the same magnitude on its support, its -norm is decreasing in . Since , this makes it much less likely that individual coordinates of can be large. Since this probability goes down rapidly, we can afford to union bound over the larger net size. What we show is that for any sum of the form , at most of its coordinates are at least in magnitude.

For to be at most , for at least coordinates , we must have . With probability , on at least coordinates. From the previous paragraph, it follows there are at least coordinates of for which (1) , (2) and (3) . On these , must be in an interval of width at distance at least from the origin. Since , for any value of the probability this happens on coordinates is at most . Since the net size for is small, we can union bound over every sequence coming from our nets.

Some care is needed to union bound over all possible subsets of rows which can be chosen. We handle this by conditioning on a few events of itself, which imply corresponding events for every subset of rows. These events are such that if is the chosen set of half the rows, and the remaining set of rows of , then the event that a constant fraction of rows in are close to the row span of is , which is small enough to union bound over all choices of .

Curiously, we also show there are some matrices for which any rank- approximation in the entire row span of cannot achieve better than a -approximation.

Bicriteria Algorithm: Our algorithm for small gives an -approximation in time for constant , but the approximation factor depends on . We show how one can find a rank- matrix for which , where is an absolute constant, and . We first find a rank- matrix for which for a factor . We can use any of our algorithms above for this.

Next consider the problem , and let be a best -low rank approximation to ; we later explain why we look at this problem. We can assume is an well-conditioned basis [Cla05, DDH09], since we can replace with and with for any invertible linear transformation . For any vector we then have , where . This implies all entries of are at most , as otherwise one could replace with and reduce the cost. Also, any entry of smaller than can be replaced with as this incurs additive error . If we round the entries of to integer multiples of , then we only have possibilities for each entry of , and still obtain an -approximation. We refer to the rounded as , abusing notation.

Let be a sampling and rescaling matrix with non-zero diagonal entries, corresponding to sampling by the Lewis weights of . We do not know , but handle this below. By the triangle inequality, for any ,

 ∥D(U∗V−(A−B1))∥1 = ∥D(U∗V−U∗V∗)∥1±∥D(U∗V∗−(A−B1))∥1 = Θ(1)∥U∗V−U∗V∗∥1±O(1)∥U∗V∗−(A−B1)∥1,

where the Lewis weights give and a Markov bound gives . Thus, minimizing gives a fixed constant factor approximation to the problem . The non-zero diagonal entries of can be assumed to be integers between and .

We guess the entries of and note for each entry there are only possibilities. One of our guesses corresponds to Lewis weight sampling by . We solve for and by the guarantees of Lewis weights, the row span of this provides an -approximation. We can find the corresponding via linear programming. As mentioned above, we do not know , but can enumerate over all and all possible . The total time is .

After finding , which has columns, we output the rank- space formed by the column span of . By including the column span of , we ensure our original transformation of the problem to the problem is valid, since we can first use the column span of to replace with . Replacing with ultimately results in a rank- output. Had we used instead of our output would have been rank but would have additive error . If we assume the entries of are in , then we can lower bound the cost , given that it is non-zero, by (if it is zero then we output ) using Lemma 4.1 in [CW09] and relating entrywise -norm to Frobenius norm. We can go through the same arguments above with replaced by and our running time will now be .

Hard Instances for Cauchy Matrices and More General Sketches: We consider a matrix , where and and is the identity. For an matrix of i.i.d. Cauchy random variables, , where is the first column of . For a typical column of , all entries are at most in magnitude. Thus, in order to approximate the first row of , which is , by for an , we need . Also with probability, for large enough, so by a net argument for all .

However, there are entries of that are very large, i.e., about one which is in magnitude, and in general about entries about in magnitude. These entries typically occur in columns of for which all other entries in the column are bounded by in magnitude. Thus, for about columns . For each such column, if , then we incur cost in approximating the first row of . In total the cost is , but the optimal cost is at most , giving a lower bound. We optimize this to a lower bound.

When is large this bound deteriorates, but we also show a lower bound for arbitrarily small constant . This bound applies to any oblivious sketching matrix. The idea is similar to our row subset selection lower bound. Let be as in our row subset selection lower bound, consider , and write in its full SVD. Then is in the row span of the top rows of , since only has non-zero singular values. Since the first columns of are rotationally invariant, has first columns i.i.d. Gaussian and remaining columns equal to . Call the first rows of the matrix . We now try to approximate a row of by a vector in the row span of . There are two issues that make this setting different from row subset selection: (1) no longer contains an identity submatrix, and (2) the rows of depend on the rows of . We handle the first issue by building nets for subsets of coordinates of rather than as before; since similar arguments can be applied. We handle the second issue by observing that if the number of rows of is considerably smaller than that of , then the distribution of had we replaced a random row of with zeros would be statistically close to i.i.d. Gaussian. Hence, typical rows of can be regarded as being independent of .

Limited Independence, Distributed, and Streaming Algorithms: We show for an matrix , if we left-multiply by an matrix in which each row is an independent vector of -wise independent Cauchy random variables, contains a <