Efficient volume sampling for row/column subset selection

Efficient volume sampling for row/column subset selection

Amit Deshpande
Microsoft Research India
amitdesh@microsoft.com
   Luis Rademacher
Computer Science and Engineering
Ohio State University
lrademac@cse.ohio-state.edu
Abstract

We give efficient algorithms for volume sampling, i.e., for picking -subsets of the rows of any given matrix with probabilities proportional to the squared volumes of the simplices defined by them and the origin (or the squared volumes of the parallelepipeds defined by these subsets of rows). This solves an open problem from the monograph on spectral algorithms by Kannan and Vempala (see Section of [15], also implicit in [1, 5]).

Our first algorithm for volume sampling -subsets of rows from an -by- matrix runs in arithmetic operations and a second variant of it for -approximate volume sampling runs in arithmetic operations, which is almost linear in the size of the input (i.e., the number of entries) for small .

Our efficient volume sampling algorithms imply the following results for low-rank matrix approximation:

  1. Given , in arithmetic operations we can find of its rows such that projecting onto their span gives a -approximation to the matrix of rank closest to under the Frobenius norm. This improves the -approximation of Boutsidis, Drineas and Mahoney [1] and matches the lower bound shown in [5]. The method of conditional expectations gives a deterministic algorithm with the same complexity. The running time can be improved to at the cost of losing an extra in the approximation factor.

  2. The same rows and projection as in the previous point give a -approximation to the matrix of rank closest to under the spectral norm. In this paper, we show an almost matching lower bound of , even for .

Keywords: volume sampling, low-rank matrix approximation, row/column subset selection

1 Introduction

Volume sampling, i.e., picking -subsets of the rows of any given matrix with probabilities proportional to the squared volumes of the simplicies defined by them, was introduced in [5] in the context of low-rank approximation of matrices. It is equivalent to sampling -subsets of with probabilities proportional to the corresponding by principal minors of any given by positive semidefinite matrix.

In the context of low-rank approximation, volume sampling is related to a problem called row/column-subset selection [1]. Most large data sets that arise in search, microarray experiments, computer vision, data mining etc. can be thought of as matrices where rows and columns are indexed by objects and features, respectively (or vice versa), and we need to pick a small subset of features that are dominant. For example, while studying gene expression data biologists want a small subset of genes that are responsible for a particular disease. Usual dimension reduction techniques such as principal component analysis (PCA) or random projection fail to do this as they typically output singular vectors or random vectors which are linear combinations of a large number of feature vectors. A recent article by Mahoney and Drineas [18] highlights the limitations of PCA and gives experimental data on practical applications of low-rank approximation based on row/column-subset selection.

2 Row/column-subset selection and volume sampling

While dealing with large matrices in practice, we seek smaller or low-dimensional representations of them which are close to them but can be computed and stored efficiently. A popular notion for low-dimensional representation of matrices is low-rank matrices, and the most popular metrics used to measure the closeness of two matrices are the Frobenius or Hilbert-Schmidt norm (i.e., the square root of the sum of squares of entries of their difference) and the spectral norm (i.e., the largest singular value of their difference). The singular value decomposition (SVD) tells us that any matrix can be written as

where , are orthonormal and are orthonormal. Moreover, the nearest rank- matrix to , let us call it , under both the Frobenius and the spectral norm, is given by

In other words, the rows of are projections of the rows of onto . Because of this, most dimension reduction techniques based on the singular value decomposition, e.g., principal component analysis (PCA), are interpreted as giving ’s as the dominant vectors, which happen to be linear combinations of a large number of the rows or feature vectors of .

The row-subset selection problem we consider in this paper is: Can we pick a -subset of the rows (or feature vectors) of so that projecting onto their span is almost as good as projecting onto ?

Several row-sampling techniques have been considered in the past as an approximate but faster alternative to the singular value decomposition, in the context of streaming algorithms and large data sets that cannot be stored in random access memory [8, 7, 5]. The first among these is the squared-length sampling of rows introduced by Frieze, Kannan and Vempala [8]. Another sampling scheme due to Drineas, Mahoney and Muthukrishnan [7] uses the singular values and singular vectors to decide the sampling probabilities. Later Deshpande, Rademacher, Vempala and Wang [5] introduced volume sampling as a generalization of squared-length sampling.

Definition 1.

Given , volume sampling is defined as picking a -subset of with probability proportional to

where denotes the -th row of , denotes the row-submatrix of given by rows with indices , and denotes the convex hull.

The application of volume sampling to low-rank approximation and, more importantly, to the row-subset selection problem, is given by the following theorem shown in [5]. It says that picking a subset of rows according to volume sampling and projecting all the rows of onto their span gives a -approximation to the nearest rank- matrix to .

Theorem 2.

[5] Given any ,

when is picked according to volume sampling, denotes the matrix obtained by projecting all the rows of onto , and is the matrix of rank closest to under the Frobenius norm.

As we will see later, this easily implies

Theorem 2 gives only an existence result for row-subset selection and we also know a matching lower bound that says this is the best we can possibly do.

Theorem 3.

[5] For any , there exists a matrix such that picking any -subset of its rows gives

However, no efficient algorithm was known for volume sampling prior to this work. An algorithm mentioned in Deshpande and Vempala [6] does -approximate volume sampling in time , which means that plugging it in Theorem 2 can only guarantee -approximation instead of . Finding an efficient algorithm for volume sampling is mentioned as an open problem in the recent monograph on spectral algorithms by Kannan and Vempala (see Section of [15]).

Boutsidis, Drineas and Mahoney [1] gave an alternative approach to row-subset selection (without going through volume sampling) and here is a re-statement of the main theorem from their paper which uses columns instead of rows.

Theorem 4.

[1] For any , a -subset of its rows can be found in time such that

Row/column-subset selection problem is related to rank-revealing decompositions considered in linear algebra [12, 19], and the previous best algorithmic result for row-subset selection in the spectral norm case was given by a result of Gu and Eisenstat [12] on strong rank-revealing QR decompositions. The following theorem is a direct consequence of [12] as pointed out in [1].

Theorem 5.

Given , an integer and , there exists a -subset of the columns of such that

Moreover, this subset can be found in time .

In the context of volume sampling, it is interesting to note that Pan [19] has used an idea of picking submatrices of locally maximum volume (or determinants) for rank-revealing matrix decompositions. We refer the reader to [19] for details.

The results of Goreinov, Tyrtyshnikov and Zamarashkin [10, 11] on pseudo-skeleton approximations of matrices look at submatrices of maximum determinants as good candidates for row/column-subset selection.

Theorem 6.

[10] If can be written as

where is the by submatrix of of maximum determinant. Then,

Because of this relation between row/column-subset selection and the related ideas about picking submatrices of maximum volume, Çivril and Magdon-Ismail [3, 4] looked at the problem of picking a -subset of rows of a given matrix such that is maximized. They show that this problem is NP-hard [3] and moreover, it is NP-hard to even approximate it within a factor of , for some constant [4]. This is interesting in the light of our results because we show that even though finding the row-submatrix of maximum volume is NP-hard, we can still sample them with probabilities proportional to their volumes in polynomial time.

2.1 Our results

Our main result is a polynomial time algorithm for exact volume sampling. In Section 4, we give an outline of our Algorithm 1, followed by two possible subroutines given by Algorithms 2 and 3 that could be plugged into it.

Theorem 7 (polynomial-time volume sampling).

The randomized algorithm given by the combination of the algorithm outlined in Algorithm 1 with Algorithm 2 as its subroutine, when given a matrix and an integer , outputs a random -subset of the rows of according to volume sampling, using arithmetic operations.

The basic idea of the algorithm is as follows: instead of picking a -subset, pick an ordered -tuple of rows according to volume sampling (i.e., volume sampling suitably extended to all -tuples such that for any fixed -subset, all its permutations are all equally likely). We observe that the marginal distribution of the first coordinate of such a random tuple can be expressed in terms of coefficients of the characteristic polynomials of and , where is the matrix obtained by projecting each row of orthogonal to the -th row . Using this interpretation, it is easy to sample the first index of the -tuple with the right marginal probability. Now we project the rows of orthogonal to the chosen row and repeat to pick the next row, until we have picked of them.

The algorithm just described informally, if implemented as stated, would have a polynomial dependence in , and , for some low-degree polynomial. We can do better and get a linear dependence in by working with in place of and computing the projected matrices using rank- updates (Theorem 7), while still having a polynomial time guarantee and sampling exactly. It would be even faster to perform rank-1 updates to the characteristic polynomial itself, but that requires the computation of the inverse of a polynomial matrix (Proposition 18), and it is not clear to us at this time that there is a fast enough exact algorithm that works for arbitrary matrices. Jeannerod and Villard [14] give an algorithm to invert a generic -by- matrix with entries of degree , with a power of two, in time . This would lead to the computation of all marginal probabilities for one row in time (a variation of Algorithm 3 and its analysis).

Instead, if we are willing to be more practical, while sacrificing our guarantees, then we can perform rank- updates to the characteristic polynomial by using the singular value decomposition (SVD). In [9], an algorithm with cost arithmetic operations is given for the singular value decomposition but the SVD cannot be computed exactly and we do not know how its error propagates in our algorithm which uses many such computations. If the SVD of an -by- matrix can be computed in time , this leads to a nearly-exact algorithm for volume sampling in time . See Proposition 18 for details.

Volume sampling was originally defined in [5] to prove Theorem 2, in particular, to show that any matrix contains rows in whose span lie the rows of a rank- approximation to that is no worse than the best in the Frobenius norm. Efficient volume sampling leads to an efficient selection of rows that satisfy this guarantee, in expectation. In Section 5, we use the method of conditional expectations to derandomize this selection. This gives an efficient deterministic algorithm (Algorithm 4) for row-subset selection with the following guarantee in the Frobenius norm. This guarantee immediately implies a guarantee in the spectral norm, as follows:

Theorem 8 (deterministic row subset selection).

Deterministic Algorithm 4, when given a matrix and an integer , outputs a -subset of the rows of , using arithmetic operations, such that

This improves the -approximation of Boutsidis, Drineas and Mahoney [1] for the Frobenius norm case and matches the lower bound shown in Theorem 3 due to [5].

The superlinear dependence on might be too slow for some applications, while it might be acceptable to perform volume sampling or row/column-subset selection approximately. Our volume sampling algorithm (Algorithm 1) can be made faster, while losing on the exactness, by using the idea of random projection that preserves volumes of subsets. Magen and Zouzias [17] have the following generalization of the Johnson-Lindenstrauss lemma: for points in there exists a random projection of them into , where , that preserves the volumes of simplices formed by subsets of or fewer points within . Therefore, we get a -approximate volume sampling algorithm that requires -time to do the random projection (by matrix multiplication) and then time for volume sampling on the new -by- matrix (according to Theorem 7).

Theorem 9 (fast volume sampling).

Using random projection for dimensionality reduction, the polynomial time algorithm for volume sampling mentioned in Theorem 7 (i.e., Algorithm 1 with Algorithm 2 as its subroutine), gives -approximate volume sampling, using

arithmetic operations.

Finally, we show a lower bound for row/column-subset selection in the spectral norm that almost matches our upper bound in terms of the dependence on .

Theorem 10 (lower bound).

There exists a matrix such that

where is the matrix obtained by projecting each row of onto the span of its -th row .

3 Preliminaries and notation

For , let denote the set . For any matrix , we denote its rows by . For , let be the row-submatrix of given by the rows with indices in . By we denote the linear span of and let be the matrix obtained by projecting each row of onto . Hence, is the matrix obtained by projecting each row of orthogonal to .

Throughout the paper we assume . This assumption is not needed most of the time, but justifies sometimes working with instead of and, more generally, some choices in the design of our algorithms. It is also partially justified by our use of a random projection as a preprocessing step that makes small.

The singular values of are defined as the positive square-roots of the eigenvalues of (or , up to some extra singular values equal to zero), and we denote them by . Well-known identities of the singular values like

can be generalized into the following lemma.

Lemma 11.

(Proposition 3.2 in [5]) For any ,

where are the singular values of , i.e., eigenvalues of , and

is the characteristic polynomial of . Using , we can alternatively use in the above formula, for .

Let be the exponent of the arithmetic complexity of matrix multiplication. We use that there is an algorithm for computing the characteristic polynomial of an -by- matrix using arithmetic operations [2, Section 16.6].

Here is another lemma that we will need about dividing determinants into products of two determinants.

Lemma 12.

Let , , and . Then

Proof.

Without loss of generality, we can reduce ourselves to the case where is all rows of the given matrix: Let , . We have . Then what we want to prove can be rewritten as:

To show this, we consider two cases. If is singular, then both sides of the equality are zero. If is invertible, then we can perform block Gaussian elimination and write

applied to . Writing the determinants of the block-triangular matrices gives

Now, the projection of the rows of a matrix onto the row-space of a matrix can be written as

so that , and

This completes the proof. ∎

Finally, a well-known lemma about how the determinant of a matrix changes under a rank- update.

Lemma 13 (matrix determinant lemma).

For any invertible and ,

4 Efficient volume sampling algorithms

We first outline our volume sampling algorithm to convince the reader that volume sampling can be done in polynomial time. In the subsequent subsections, we give improved subroutines to get faster implementations of the same idea.

The main idea behind our algorithm is based on Lemma 14 about the marginal probabilities encountered in volume sampling. To explain this, it is more convenient to look at volume sampling defined as a distribution on -tuples instead of -subsets, where each of the permutations of a -subset is equally likely, i.e., for any ,

Then the marginal probabilities have the following interpretation in terms of the coefficients of certain characteristic polynomials.

Lemma 14.

Let such that , for a random -tuple from the extended volume sampling over -tuples. Let , and . Then,

Proof.

With this lemma in hand, let us consider the following outline of our algorithm. We will later give two more efficient implementations of this outline, depending on how the ’s are computed.

Algorithm 1.

Outline of our volume sampling algorithm

Input: a matrix and .

Output: a subset of rows of picked with probability proportional to .

  1. Initialize and . For to do:

    1. For to compute:

      where is a matrix obtained by projecting each row of orthogonal to .

    2. Pick with probability proportional to . Let and .

  2. Output .

Now we show the correctness of the algorithm:

Proposition 15.

The probability that our volume sampling algorithm outlined above picks a -subset is proportional to . This algorithm can be implemented with a cost of arithmetic operations.

Proof.

By Lemma 14, for any such that , the probability that our algorithm picks a sequence of rows indexed in that order is equal to

Otherwise, the probability is zero because in the execution of the algorithm, for some step . This proves the correctness of our algorithm.

Given that one can compute the characteristic polynomial of an -by- matrix in (see Section 3), our outline can be implemented with the following count of arithmetic operations: for every and , to compute , in total for . Thus, volume sampling in . ∎

4.1 Efficient volume sampling without SVD

Here we present the first (faster) subroutine for computing the marginal probabilities ’s within the volume sampling algorithm outlined in Section 4. The two main ideas behind this subroutine are: (1) We can work with instead of . Assuming , this saves on running time. (2) Each is a rank- update of and therefore, once we have , it can be used to compute all efficiently.

Algorithm 2.

First subroutine for marginal probabilities

Input: .

Output: .


For to do:

  1. Compute the matrix by the following formula

  2. Compute the characteristic polynomial of and output

Proposition 16.

For any given , the Algorithm 2 above computes in arithmetic operations.

Proof.

can be computed in time . Observe that since is obtained by projecting each row of orthogonal to ,

and therefore,

So once we have , for each , can be computed in time and the characteristic polynomial of can be computed in time [2, Section 16.6]. By Lemma 11, and hence, the above subroutine results into an time algorithm for volume sampling. ∎

Theorem 17 (same as Theorem 7).

The randomized algorithm given by the combination of the algorithm outlined in Algorithm 1 with Algorithm 2 as its subroutine, when given a matrix and an integer , outputs a random -subset of the rows of according to volume sampling, using arithmetic operations.

Proof.

The proof follows by combining Proposition 15 and Proposition 16, and since we compute all the ’s simultaneously in each round in arithmetic operations, the total number of arithmetic operations is . ∎

4.2 Efficient volume sampling using SVD

Taking further the idea that each is a rank- update of , we can give a faster algorithm based on the singular value decomposition of . Given the singular value decomposition of a matrix and using the matrix determinant lemma (Lemma 13), one can give a precise formula for how the characteristic polynomial changes under a rank- update. Using this subroutine in the volume sampling algorithm outlined in Section 4 we get an algorithm for nearly-exact volume sampling (depending on the precision of the computed SVD) in time , where is the running time of SVD on an -by- matrix.

Algorithm 3.

Second subroutine for marginal probabilities.

Input: .

Output: .

  1. Compute the (thin) singular value decomposition , say and , and keep the singular values and define . Also keep the columns of , i.e., the left singular vectors .

  2. Compute the polynomial products

  3. For to output:

Proposition 18.

In the real arithmetic model and given exact and , using the Algorithm 3 as a subroutine inside Algorithm 1 outlined for volume sampling, we get an algorithm for volume sampling. If is the running time for computing the singular value decomposition of -by- matrices, the algorithm runs in time .

Proof.

Using the matrix determinant lemma (Lemma 13), the characteristic polynomial of can be written as

Thus,

Once we have the singular value decomposition of , and can all be computed in time using polynomial products. This is because there are at most non-zero ’s. Thus, and all the for can be computed in time and then using the above formula we get . ∎

4.3 Approximate volume sampling in nearly linear time

Magen and Zouzias [17] showed that the random projection lemma of Johnson and Lindenstrauss can be generalized to preserve volumes of subsets after embedding. Here is a restatement of Theorem 1 of [17] using instead of in their original statement.

Theorem 19.

[17] For any , and , there is

and there is a mapping such that

for all such that , where has its -th row as . Moreover, is a linear mapping given by multiplication with a random by matrix with i.i.d. Gaussian entries, so computing takes time .

Theorem 20 (same as Theorem 9).

Using random projection for dimensionality reduction, the polynomial time algorithm for volume sampling mentioned in Theorem 7 (i.e., Algorithm 1 with Algorithm 2 as its subroutine), gives -approximate volume sampling, using

arithmetic operations.

Proof.

Using Theorem 19 and doing volume sampling of -subsets of rows from gives -approximation to the volume sampling of -subsets of rows from . This can be done in two steps: first, we compute using matrix multiplication in time and second, we do volume sampling on using the algorithm from Subsection 4.1. Overall, it takes time , which is equal to

Moreover, this can be implemented using only one pass over the matrix with extra space . ∎

5 Derandomized row/column-subset selection

Our derandomized row-subset selection algorithm is based on a derandomization of the volume sampling algorithm in Section 4, using the method of conditional expectations. Again, it may be easier to consider volume sampling extended to random -tuples where