Sampling and multilevel coarsening algorithms for fast matrix approximationsThis work was supported by NSF under the grant NSF/CCF-1318597.

Sampling and multilevel coarsening algorithms for fast matrix approximationsthanks: This work was supported by NSF under the grant NSF/CCF-1318597.

Shashanka Ubaru and Yousef Saad Department of Computer Science and Engineering, University of Minnesota at Twin Cities, MN 55455. Email: ubaru001@umn.edu, saad@umn.edu.
June 25, 2019
Abstract

This paper addresses matrix approximation problems for matrices that are large, sparse and/or that are representations of large graphs. To tackle these problems, we consider algorithms that are based primarily on coarsening techniques, possibly combined with random sampling. A multilevel coarsening technique is proposed which utilizes a hypergraph associated with the data matrix and a graph coarsening strategy based on column matching. Theoretical results are established that characterize the quality of the dimension reduction achieved by a coarsening step, when a proper column matching strategy is employed. We consider a number of standard applications of this technique as well as a few new ones. Among the standard applications we first consider the problem of computing the partial SVD for which a combination of sampling and coarsening yields significantly improved SVD results relative to sampling alone. We also consider the Column subset selection problem, a popular low rank approximation method used in data related applications, and show how multilevel coarsening can be adapted for this problem. Similarly, we consider the problem of graph sparsification and show how coarsening techniques can be employed to solve it. Numerical experiments illustrate the performances of the methods in various applications.

Key words. Singular values, SVD, randomization, subspace iteration, coarsening, multilevel methods.

AMS subject classifications. 15A69, 15A18

1 Introduction

Many modern applications related to data often involve very large datasets, but their relevant information lie on a low dimensional subspace. In many of these applications, the data matrices are often sparse and/or are representations of large graphs. In recent years, there has been a surge of interest in approximating large matrices in a variety of different ways, such as by low rank approximations [16, 25, 37], graph sparsification [55, 27], and compression [32]. Low rank approximations include the partial singular value decomposition (SVD) [25] and Column Subset Selection (the CSS Problem) [7]. A variety of methods have been developed to efficiently compute partial SVDs of matrices [49, 23], a problem that has been studied for a few decades. However, traditional methods for partial SVD computations cannot cope with very large data matrices. Such datasets prohibit even the use of rather ubiquitous methods such as the Lanczos or subspace iteration algorithms [49, 50], since these algorithms require consecutive accesses to the whole matrix multiple times. Computing such matrix approximations is even harder in the scenarios where the matrix under consideration receives frequent updates in the form of new columns or rows.

Much recent attention has been devoted to a class of ‘random sampling’ techniques [15, 16, 25] whereby an approximate partial SVD is obtained from a small subset of the matrix only, or possibly a few subsets. Random sampling is well-established (theoretically) and is proven to give good results in some situations, see [37] for a review. In this paper we will consider random sampling methods as one of the tools to down sample very large datasets. However, because randomized methods assume no prior information on the data and are independent of the input matrix they are often termed “data-oblivious” [1]. Because of this feature, they can be suboptimal in many situations since they do not exploit any available information or structures in the matrix. One of the goals of this work is to show that multilevel graph coarsening techniques [26] can be good alternatives to randomized sampling.

Coarsening a graph (or a hypergraph) means finding a ‘coarse’ approximation to with , which is a reduced representation of the original graph , that retains as much of the structure of the original graph as possible. Multilevel coarsening refers to the technique of recursively coarsening the original graph to obtain a succession of smaller graphs that approximate the original graph . Several methods exist in the literature for coarsening graphs and hypergraphs [26, 28, 9, 29]. These techniques are relatively more expensive than down-sampling with column norm probabilities [16] but they are more accurate. Moreover, coarsening methods will be inexpensive compared to the popular leverage scores based sampling [17] which is more accurate than norm sampling. For very large matrices, a typical algorithm would first perform randomized sampling to reduce the size of the problem and then utilize a multilevel coarsening technique for computing an approximate partial SVD of the reduced matrix.

Our Contribution

In this paper, we present a multilevel coarsening technique that utilizes a hypergraph associated with the data matrix and a coarsening strategy that is based on column matching, and discuss various applications for this technique. We begin by discussing different approaches to find partial SVD of large matrices, starting with random sampling methods. We also consider incremental sampling, where we start with small samples and then increase the size until a certain criterion is satisfied. The second approach is to replace random sampling, with a form of multilevel coarsening technique. A middle ground solution is to start with random coarsening and then utilize multilevel coarsening on the resulting sampled subset. The coarsening techniques exploit inherent redundancies and structures in the matrix and perform better than randomized sampling in many cases as is confirmed by the experiments. We establish theoretical error analysis for a class of coarsening techniques. We also show how the SVD update approach, see [65] or subspace iteration can be used after the sampling or coarsening step to improve the SVD results. This approach is useful when an accurate SVD of a large matrix is desired.

The second low rank approximation problem considered in this paper is that of column subset selection problem [7, 66] (CSSP) or CUR decomposition [36, 17]. Popular methods for CSSP use leverage score sampling method for sampling/selecting the columns. Computing the leverage scores requires a partial SVD of the matrix and this may be expensive, particularly for large matrices and when the (numerical) rank is not small. In this work, we show how the graph coarsening techniques can be adapted for column subset selection (CSSP). The coarsening approach is an inexpensive alternative for this problem and performs well in many situations.

The third problem we consider is that of graph sparsification [31, 55, 27]. Here, given a large (possibly dense) graph , we wish to obtain a sparsified graph that has significantly fewer edges than but still maintains important properties of the original graph. Graph sparsification allows one to operate on large (dense) graphs with a reduced space and time complexity. In particular, we are interested in spectral sparsifier, where the Laplacian of spectrally approximates the Laplacian of  [56, 27, 67]. That is, the spectral norm of the Laplacian of the sparsified graph is close to the spectral norm of the Laplacian of , within a certain additive or multiplicative factor. Such spectral sparsifiers can help approximately solve linear systems with the Laplacian of and to approximate effective resistances, spectral clusterings, random walk properties, and a variety of other computations. We again show how the graph coarsening techniques can be adapted to achieve graph sparsifications. We also present a few new applications for coarsening methods, see section LABEL:sec:appl.

Outline

The outline of this paper is as follows. Section LABEL:sec:appl, discusses a few applications of graph coarsening. Section LABEL:sec:basic describes existing popular algorithms that are used for low rank approximation. The graph coarsening techniques and the multilevel algorithms are described in sec. LABEL:sec:coarsen. In particular, we present a hypergraph coarsening technique based on column matching. We also discuss methods to improve the SVD obtained from randomized and coarsening methods. In section LABEL:sec:analysis, we establish a theoretical error analysis for the coarsening method. We also discuss the existing theory for randomized sampling and subspace iteration. Numerical experiments illustrating the performances of these methods in a variety of applications are presented in section LABEL:sec:expt.

2 Applications

We present a few applications of (multilevel) coarsening methods. In these applications, we typically encounter large matrices, and these are often sparse and/or representations of graphs.

i. Latent Semantic Indexing

Latent semantic indexing (LSI) is a popular text mining technique for analyzing a collection of documents that are similar [13, 33, 5, 30]. Given a user’s query, the method is used to retrieve a set of documents from a given collection that are relevant to the query. Truncated SVD [5] and related methods [30] are popular tools used in the LSI applications. The argument is that a low rank approximation preserves the important underlying structure associated with terms and documents, and removes the noise or variability in word usage [16]. Multilevel coarsening for LSI was considered in [51]. In this work, we revisit this idea and show how hypergraph coarsening can be employed in this application.

ii. Projective clustering

Several projective clustering methods such as Isomap [58], Local Linear Embedding (LLE) [47], spectral clustering [40], subspace clustering [43, 18], Laplacian eigenmaps [4] and others involve partial eigen-decomposition and SVD computation of a graph Laplacian. Various kernel based learning methods [39] also involve SVD computation of large graph Laplacians. In most applications today, the number of data-points are large and computing the singular vectors (eigenvectors) will be expensive. Graph coarsening is a handy tool to reduce the number of data-points in these applications, see [20, 41] for results.

iii. Eigengene analysis

Analysis of gene expression DNA microarray data has become an important tool when studying a variety of biological processes [2, 46, 44]. In a microarray dataset, we have genes (from individuals possibly from different populations) and a series of arrays probe genome-wide expression levels in different samples, possibly under different experimental conditions. The data is large with several individuals and gene expressions, but is known to be of low rank. Hence, it has been shown that a small number of eigengenes and eigenarrays (few singular vectors) are sufficient to capture most of the gene expression information [2]. Article [44] showed how column subset selection (CSSP) can be used for selecting a subset of gene expressions that describe the population well in terms of spectral information captured by the reduction. In this work, we show how hypergraph coarsening can be adapted to choose a good (small) subset of genes in this application.

iv. Multilabel Classification

The last application we consider is that of multilabel classification in machine learning applications [60, 61]. In the multilabel classification problem, we are given a set of labeled training data , where each is an input feature for a data instance which belongs to one or more classes, and are vectors indicating the corresponding labels (classes) to which the data instances belong. A vector has a one at the th coordinate if the instance belongs to -th class. We wish to learn a mapping (prediction rule) between the features and the labels, in order to be able to predict a class label vector of a new data point . Such multilabel classification problems occur in many domains such as computer vision, text mining, and bioinformatics [59, 57], and modern applications involve a large number of labels.

A popular approach to handle classification problems with many classes is to begin by reducing the effective number of labels by means of so-called embedding-based approaches. The label dimension is reduced by projecting label vectors onto a low dimensional space, based on the assumption that the label matrix has a low-rank. The reduction is achieved in different ways, for example, by using SVD in [57] and column subset selection in [6]. In this work, we demonstrate how hypergraph coarsening can be employed to reduce the number of classes, and yet achieve accurate learning and prediction.

Article [54] discusses a number of methods that rely on clustering the data first in order to built a reduced dimension representation. It can be viewed as a top-down approach whereas coarsening is a bottom-up method.

3 Background

In this section, we review three popular classes of methods used for calculating the partial SVD of matrices. The first class is based on randomized sampling. We also consider the column subset selection (CSSP) and graph sparsification problems using randomized sampling, in particular leverage score sampling. The second class is the set of methods based on subspace iteration, and the third is the set of SVD-updating algorithms [68, 65]. We consider the latter two classes of methods as tools to improve the results obtained by sampling and coarsening methods. Hence, we are particularly interested in the situation where the matrix under consideration receives updates in the form of new columns. In fact when coupling with the multilevel algorithms (which we will discuss in sec. LABEL:sec:coarsen), these updates are not small since the number of columns can double.

3.1 Random sampling

Randomized algorithms have become popular in recent years due to their broad applications and the related theoretical analysis developed which give results that are independent of the matrix spectrum. Several ‘randomized embedding’ and ‘sketching’ methods have been proposed for low rank approximation and for computing the partial SVD [38, 35, 25, 62] starting with the seminal work of Frieze et al. [21]. Drineas et al. [15, 16] presented the randomized subsampling algorithms, where a submatrix (certain columns of the matrix) is randomly selected based on a certain probability distribution. Their method samples the columns based on column norms. Given a matrix , they sample its columns such that the -th column is sampled with the probability given by

where is a positive constant and is the -th column of . Using the above distribution, columns are selected and the subsampled matrix is formed by scaling the columns by . Then, the SVD of is computed. The approximations obtained by this randomization method will yield reasonable results only when there is a sharp decay in the singular value spectrum.

3.2 Column Subset Selection

Another popular dimensionality reduction method which we consider in this paper is the column subset selection (CSSP) [7]. If a subset of the rows is also selected, then the method leads to the CUR decomposition [36]. These methods can be viewed as extensions of the randomized sampling based algorithms. Let be a large data matrix whose columns we wish to select and suppose is a matrix whose columns are the top right singular vectors of . Then, the leverage score of the -th column of is given by

the scaled square norm of the -th row of . Then, in leverage scores sampling, the columns of are sampled using the probability distribution . The most popular methods for CSSP involve the use of this leverage scores as the probability distribution for columns selection [17, 7, 36, 8]. Greedy subset selection algorithms have been also proposed based on the right singular vectors of the matrix [44, 3]. However, these methods may be expensive since one needs to compute the top singular vectors. In this work, we see how the coarsened graph, i.e., the columns obtained by graph coarsening perform in CSSP.

3.3 Graph Sparsification

Sparsification of large graphs has several computational (cost and space) advantages and has hence found many applications [31, 34, 53, 55, 56]. Given a large graph with vertices, we wish to find a sparse approximation to this graph that preserves certain information of the original graph such as the spectral information [56, 27], structures like clusters within in the graph [31, 34], etc. Let be the vertex edge incidence matrix of the graph , where th row of for edge of the graph has a value in columns and , and zero elsewhere. The corresponding Laplacian of the graph is then given by .

The spectral sparsification problem involves computing a weighted subgraph of such that if is the Laplacian of , then is close to for any . Many methods have been proposed for the spectral sparsification of graphs, see e.g., [55, 56, 27, 67]. A popular approach is to perform row sampling of the matrix using the leverage score sampling [27]. Considering the SVD of , the leverage scores for a row of can be computed as using the rows of . This leverage score is related to the effective resistance of edge  [55]. By sampling the rows of according to their leverage scores it is possible to obtain a matrix , such that and is close to for any . In section LABEL:sec:coarsen, we show how the rows of can we selected via coarsening.

3.4 Subspace iteration

Subspace iteration is a well-established method used for solving eigenvalue and singular value problems [23, 49]. We review this algorithm as it will be exploited later as a tool to improve SVD results obtained by sampling and coarsening methods. A known advantage of the subspace iteration algorithm is that it is very robust and that it tolerates changes in the matrix [50]. This is important in our context. Let us consider a general matrix , not necessarily associated with a graph. The subspace iteration algorithm can easily be adapted to the situation where a previous SVD is available for a smaller version of with fewer rows or columns, obtained by subsampling or coarsening for example. Indeed, let be a column-sampled version of . In matlab notation we represent this as where is a subset of the column index . Let be another subsample of , where we assume that . Then if , we can perform a few steps of subspace iteration updates as shown in Algorithm LABEL:alg:subsit.

Start:
for  iter do
     
     
     ;
     
     if condition then
         // Diagonalize to obtain current estimate of singular vectors and values
         ; ;
     end if
end for
Algorithm 1 Incremental Subspace Iteration

3.5 SVD updates from subspaces

A well known algorithm for updating the SVD is the ‘updating algorithm’ of Zha and Simon [68]. Given a matrix and its partial SVD , the matrix is updated by adding columns to it, resulting in a new matrix . The algorithm then first computes

(1)

the truncated QR decomposition of , where has orthonormal columns and is upper triangular. Given (LABEL:eqn:qr_doc), one can observe that

(2)

where denotes the -by- identity matrix. Thus, if , , and are the matrices corresponding to the dominant singular values of and their left and right singular vectors, respectively, then the desired updates , , and are given by

(3)

The QR decomposition in the first step eq. (LABEL:eqn:relation_docs) can be expensive when the updates are large so an improved version of this algorithm was proposed in [65] where this factorization is replaced by a low rank approximation of the same matrix. That is, for a rank , we compute a rank- approximation, Then, the matrix is the update equation (LABEL:eqn:upd_docs) will be

with . The idea is that the update will likely be low rank outside the previous top singular vector space. Hence a low rank approximation of suffices, thus reducing the cost.

In the low rank approximation applications, the rank will be typically much smaller than , and it can be computed inexpensively using the recently proposed numerical rank estimation methods [63, 64].

4 Coarsening

The previous section discussed randomization methods, which work well in certain situations, for example, when there is a good gap in the spectrum or there is a sharp spectral decay. An alternative method to reduce the matrix dimension, particularly when the matrices are associated with graphs, is to coarsen the data with the help of graph coarsening, perform all computations on the resulting reduced size matrix, and then project back to the original space. Similarly to the idea of sampling columns and computing the SVD of the smaller sampled matrix, in the coarsening methods, we compute the SVD from the matrix corresponding to the coarser data. It is also possible to then wind back up and correct the SVD gradually, in a way similar to V-cycle techniques in multigrid [51], this is illustrated in Figure LABEL:fig:coarsen(left). See, for example [70, 51, 20, 45] for a few illustrations where coarsening is used in data-related applications.

Figure 1: Left: Coarsening / uncoarsening procedure; Right : A sample hypergraph

Before coarsening, we first need to build a graph representing the data. This first step may be expensive in some cases but for data represented by sparse matrices, the graph is available from the data itself in the form of a standard graph or a hypergraph. For dense data, we need to set-up a similarity graph, see [10] for a fast algorithm to achieve this. This paper will focus on sparse data such as the data sets available in text mining, gene expressions and multilabel classification, to mention a few examples. In such cases, the data is represented by a (rectangular) sparse matrix and it is most convenient to use hypergraph models [70] for coarsening.

4.1 Hypergraph Coarsening

Hypergraphs extend the classical notion of graphs. A hypergraph consists of a set of vertices and a set of hyperedges  [9, 70]. In a standard graph an edge connects two vertices, whereas a hyperedge may connect an arbitrary subset of vertices. A hypergraph can be canonically represented by a boolean matrix , where the vertices in and hyperedges (nets) in are represented by the columns and rows of , respectively. This is called the row-net model. Each hyperedge, a row of , connects the vertices whose corresponding entries in that row are non-zero. An illustration is provided in Figure LABEL:fig:coarsen(Right), where and with , , , , and .

Given a (sparse) data set of entries in represented by a matrix , we can consider a corresponding hypergraph with vertex set corresponding to the columns of . Several methods exist for coarsening hypergraphs, see, e.g., [9, 29]. In this work, we consider a hypergraph coarsening based on column matching, which is a modified version of the maximum-weight matching method, e.g., [9, 14]. The modified approach follows the maximum-weight matching method and computes the non-zero inner product between two vertices and . Note that, the inner product between vectors is related to the angle between the vectors, i.e., . Next, we match two vertices only if the angle between the vertices (columns) is such that, for a constant . Another feature of the algorithm is that it applies a scaling to the coarsened columns in order to reduce the error. In summary, we combine two columns and if the angle between them is such that, . We replace the two columns and by

or , the one that has more nonzeros. This minor modification provides some control over the coarsening procedure using the parameter and, more importantly, it helps establish theoretical results for the method, see section LABEL:sec:analysis.

Input: , .
Output: Coarse matrix .
repeat
     Randomly pick ; .
     Set for , and .
     for all  with  do
         for all  with  do
              . // (*)
         end for
     end for
     
     .
     if [ then
         . // The denser of columns and
         .
     else
         .
         .
     end if
until 
Algorithm 2 Hypergraph coarsening by column matching.

The vertices can be visited in a random order, or in the ‘natural’ order in which they are listed. For each unmatched vertex , all the unmatched neighbor vertices are explored and the inner product between and each is computed. This typically requires the data structures of and its transpose, in that a fast access to rows and columns is required. The vertex with the highest non-zero inner product is considered and if the angle between them is such that (or ), then is matched with and the procedure is repeated until all vertices have been matched. Algorithm LABEL:alg:coarsening gives details on the procedure.

Note that the loop (*) computes the inner product () of columns and of . The pairing used by the algorithm relies only on the sparsity pattern. It is clear that these entries can also be used to obtain a pairing based on the cosine of the angles between columns and . The coarse column is defined as the ‘denser of columns and ’. In other models the sum is sometimes used.

Computing the cosine angle between column and all other columns is equivalent to computing the -th row of , in fact only the upper triangular part of the row. For sparse matrices, the computation of the inner product (cosine angle) between the columns can be achieved inexpensively by modifying the cosine algorithm in [48] developed for matrix blocks detection.

Computational Cost

The cost of computing all inner products of column with columns of is the sum of number of nonzeros of each columns involved:

where is the -th column and denotes cardinality. If (resp. ) is the maximum number of nonzeros in each column (resp. row), then an upper bound for the above cost is . This basic cost is equivalent to computing the upper triangular part of . Several simplifications and improvements can be added to reduce the cost. First, we can skip the columns that are already matched. In this way, fewer inner products are computed as the algorithm progresses. In addition, since we only need the angle to be such that , we can reduce the computation cost significantly by stopping as soon as we encounter a column with which the angle is smaller than the threshold. Article [11] uses the angle based column matching idea for dense subgraph detection in graphs, and describes efficient methods to compute the inner products.

4.2 Multilevel SVD computations

Given a sparse matrix , we can use Algorithm LABEL:alg:coarsening to recursively coarsen the corresponding hypergraph in the row-net model level by level, and obtain a sequence of sparse matrices with , where corresponds to the coarse graph of level for , and represents the lowest level graph . This provides a reduced size matrix which likely is a good representation of the original data. Note that, recursive coarsening will be inexpensive since the inner products required in the further levels are already computed in the first level of coarsening.

In the multilevel framework of hypergraph coarsening we apply the matrix approximation method, say using SVD, to the coarsened data matrix at the lowest level, where is the number of data items at coarse level (). A low-rank matrix approximation can be viewed as a linear projection of the columns into a lower dimensional space. In other words we have a matrix (). Applying the same linear projection to produces (), and one can expect that preserves certain features of . This linear projection is then applied to the original data to obtain a reduced representation () of the original data. The procedure is illustrated in Figure LABEL:fig:coarsen (left). The multilevel idea is used in the ConstantTimeSVD algorithm proposed in [16].

Another strategy for reducing the matrix dimension is to mix the two techniques: Coarsening may still be exceedingly expensive for some type of data where there is no immediate graph available to exploit for coarsening. In this case, a good strategy would be to downsample first using the randomized methods, then construct a graph and coarsen it. In section LABEL:sec:expt, we compare the SVDs obtained from pure randomization methods against those obtained from coarsening and also randomization + coarsening.

4.3 CSSP and graph sparsification

The multilevel coarsening technique presented can be applied for the column subset selection problem (CSSP) as well as for the graph sparsification problem. We can use Algorithm LABEL:alg:coarsening to coarsen the matrix, which is equivalent to selecting columns of the matrix. The only modification in the algorithm required is that the columns selected are not scaled. The coarse matrix contains few columns of the original matrix and yet preserves the structure of .

For graph sparsification, we can apply the coarsening procedure on the matrix , i.e., coarsen the rows of the vertex edge incidence , which yields us fewer edges, with fewer rows. The analysis in section LABEL:sec:analysis shows how this coarsening strategy is indeed a spectral sparsifier, shows is close to . Since we achieve sparsification via matching, the structures such as clusters within the original graph are also preserved.

4.4 Incremental SVD

Next, we explore some combined algorithms that improve the randomized sampling and coarsening SVD results significantly. The typical overall algorithm which we call Incremental SVD algorithm is sketched in Algorithm LABEL:alg:incsvd.

Start: select columns of by random sampling or coarsening, define as this sample of columns.
repeat
     Update (compute if started) SVD of via SVD-update or subspace iteration.
     Add columns of to
until converged
Algorithm 3 Incremental SVD

A version of this Incremental algorithm has been briefly discussed in [24], where the basic randomized algorithm is combined with subspace iteration, see Algorithm 8.1 in the reference. For subspace iteration, we know that each iteration takes the computed subspace closer to the subspace spanned by the target singular vectors. If the initial subspace is close to the span of the actual top singular vectors, fewer iterations will be needed to get accurate results. The theoretical results established in the following section, give us an idea how close the subspace obtained from the coarsening technique will be to the span of the top singular vectors of the matrix. In such cases, a few steps of the subspace iteration will then yield very accurate results.

For the SVD-RR update method, it is known that the method performs well when the updates are of low rank and do not affect the dominant subspace, the subspace spanned by the top singular vectors which of interest, too much [65]. Since the random sampling and the coarsening methods return a good approximation to the dominant subspace, we can assume that the updates in the incremental SVD are of low rank, and these updates likely effect the dominant subspace only slightly. Hence, the SVD-RR update gives accurate results.

5 Analysis

In this section, we establish theoretical results for the coarsening technique based on column matching. Suppose in the coarsening strategy, we combine two columns and if the angle between them is such that, . We replace the two columns and by (or , the one with more nonzeros). We then have the following key result.

Lemma 5.1.

Given , let be the coarsened matrix of obtained by one level of coarsening of with columns and matched if , for . Then,

(4)

for any .

Proof.

Let be a pair of matched column indices with being the index of the column that is retained after scaling. We denote by the set of all indices of the retained columns and the set of the remaining columns.

We know that , and also . Similarly, consider , where indices . Next, we have,

where the set consists of pairs of indices that are matched. Next, we consider the inner term in the summation. Let the column be decomposed as follows:

where and with a unit vector that is orthogonal to (hence, and has the same length). Then,

Let , then we have and using the fact that and , we get

Now, since our algorithm combines two columns only if (or ), we have

as and . We can further improve the bound to , provided . Thus, we have

    

The above lemma gives us bounds on the Rayleigh Quotients of the coarsened matrix . This result helps to establish the following error bounds.

Theorem 5.2.

Given , let be the coarsened matrix of obtained by one level coarsening of with columns and combined if , for . Let be the matrix consisting of the top left singular vectors of as columns. Then, we have

(5)
(6)

where is the best rank approximation of .

Proof.

Frobenius norm error: First, we prove the Frobenius norm error bound. We can express :

We get the above simplifications using the equalities: and . Let for be the columns of . Then, the second term in the above equation is .

From Lemma LABEL:lemm:RQ, we have for each ,

since ’s are the singular vectors of . Summing over singular vectors, we get

(8)

From the perturbation theory [23, Thm. 8.1.4], we have

for . Next, we have

from Lemma LABEL:lemm:RQ. Hence, summing over singular values,

(9)

Combining (LABEL:eq:part1) and (LABEL:eq:part2), we get

Combining this relation with (LABEL:eq:frob1), gives us the Frobenius norm error bound (since ).

Spectral norm error: Next, we prove the spectral norm error bound. Let and let be the orthogonal complement of . For , let , where and . Then,

since and for any , so the first term is zero and for any . Next,

Since from Lemma LABEL:lemm:RQ, , and .     

We observe that our main Theorem (Theorem LABEL:theo:1) is similar to the results developed for randomized sampling, see [15, 16]. For randomized sampling, the error reduces as the number of columns that are sampled increases. For coarsening, the error is smaller if the angles between the columns that are combined are smaller. The number of columns is related to these angles which in turn depends on the structure of the matrix. Existing theoretical results for subspace iteration are discussed in the Appendix.

6 Numerical Experiments

This section describes a number of experiments to illustrate the performances of the different methods discussed. The latter part of the section focuses on the performance of the coarsening method in the applications discussed in section LABEL:sec:appl.

6.1 SVD Comparisons

In the first set of experiments, we use three term-by-document datasets and compare the sampling, coarsening and combined methods to compute the SVD. The tests are with unweighted versions of the CRANFIELD dataset (1398 documents, 5204 terms), MEDLINE dataset (1033 documents, 8322 terms) and TIME dataset (425 documents, 13057 terms). We will use these three datasets in the experiments for column subset selection and in the latent semantic indexing application examples, which will give us an extensive evaluation of the performances of the methods compared.

Figure 2: Results for the datasets CRANFIELD (left), MEDLINE (middle), and TIME (right).

Figure LABEL:fig:coars_1 illustrates the following experiment with the three datasets. Results from four different methods are plotted. The first solid curve (labeled ‘exact’) shows the singular values of matrix from 20 to 50 computed using the svds function in Matlab (the results obtained by the four methods for top twenty singular values were similar). The diamond curve labeled ‘coarsen’, shows the singular values obtained by one level of coarsening using Algorithm LABEL:alg:coarsening. The star curve (labeled ‘rand’) shows the singular values obtained by random sampling using column norms, with a sample size equal to the size obtained with one level of coarsening. We note that the result obtained by coarsening is much better than that obtained by random sampling. However, we know that the approximations obtained by either sampling or coarsening cannot be highly accurate. In order to get improved results, we can invoke incremental SVD algorithms, Algorithm LABEL:alg:incsvd. The curve with triangles labeled ‘coars+ZS’ shows the singular values obtained when Zha Simon algorithm was used to improve the results obtained by the coarsening algorithm. Here, we consider the singular vectors of the coarse matrix and use the remaining part of the matrix to update these singular vectors and singular values. We have also included the results obtained by one iteration of power method [25], i.e., from the SVD of the matrix , where is a random Gaussian matrix of same size as the coarse matrix. We see that the smaller singular values obtained from the coarsening algorithms are better than those obtained by the one-step power method.

Figure 3: Second set of results for the CRANFIELD (left) and the MEDLINE datasets (right).

As discussed in section LABEL:sec:coarsen, a possible way of improving the SVD results obtained by a coarsening or random sampling step is to resort to subspace iteration or use the SVD update algorithms as in the first experiment. Figure LABEL:fig:coars_2 illustrates such results with incremental SVD algorithms for the CRANFIELD (left) and the MEDLINE (right) datasets. We have not reported the results for the TIME dataset since it is hard to distinguish the results obtained by different algorithms for this case. First, subspace iteration is performed using the matrix and the singular vectors obtained from coarsening or random sampling. The curve ‘coars+subs’ (star) corresponds to the singular values obtained when subspace iteration was used to improve the SVD obtained by coarsening. Similarly, for the curve labeled ‘rand+subs’ (triangle up), subspace iteration was used with the singular vectors obtained from randomized sampling. We have included the results when the SVD update algorithm was used to improve the SVD obtained by coarsening (‘coars+ZS’) and random sampling (‘rand+ZS’), respectively. These plots show that both the SVD update algorithm and subspace iteration improve the accuracy of the SVD significantly.

Dataset Coarsen Rand Sampl Rand+Coars
Err1 Err2 Err1 Err2 Err1 Err2
Kohonen 4470 50 1256 86.26 0.366 93.07 0.434 93.47 0.566
aft01 8205 50 1040 913.3 0.299 1006.2 0.614 985.3 0.598
FA 10617 30 1504 27.79 0.131 28.63 0.410 28.38 0.288
chipcool0 20082 30 2533 6.091 0.313 6.199 0.360 6.183 0.301
brainpc2 27607 30 865 2357.5 0.579 2825.0 0.603 2555.8 0.585
scfxm1-2b 33047 25 2567 2326.1 2328.8 2327.5
thermomechTC 102158 30 6286 2063.2 2079.7 2076.9
Webbase-1M 1000005 25 15625 3564.5 3551.7
Table 1: Low rank approximation: Coarsening, random sampling, and rand+coarsening. Error1 ; Error2
Figure 4: Mean absolute singular value errors (Left) and Frobenius norm errors (right) for the three methods for aft01 dataset ().

Next, we compare the performances of coarsening and random sampling for computing the low rank approximation of matrices. We also consider the combined method of sampling followed by coarsening discussed in the introduction and in section LABEL:sec:coarsen. Table LABEL:tab:svdcomp shows comparison results between the three methods, namely, Coarsening, random sampling, and random sampling+coarsening for low rank approximation of matrices from various applications. All matrices were obtained from the SuiteSparse matrix collection: https://sparse.tamu.edu/ [12] and are sparse. The errors reported are the Frobenius norm error in computing the rank approximation and the average absolute normalized error in the singular values for rank as listed in third column. The size of the input matrix and the number of columns in the coarsened/subsampled matrix are listed in the second and fourth columns, respectively. For very large matrices, the exact singular values cannot be computed, hence we were unable to report Error2 for the last 3 matrices. For Webbase-1M (size ), it is impractical to do full coarsening. Hence, we only report errors for random sampling, and random sampling+coarsening.

Figure LABEL:fig:coars_3 plots the two errors and with for the three methods for aft01 dataset when different levels of coarsening were used, i.e., the number of columns sampled/coarsened were increased. Here for ‘rand+coars’ we proceed as follows. First, half of the columns are randomly sampled and then a multilevel coarsening is performed with one level less than the pure coarsening method reported in the previous column. Hence, we do not have errors for . Coarsening clearly yields better results (lower errors) than the randomized sampling method. The combined method of random sampling+coarsening works well and performs better than randomized sampling in most cases. For a smaller number of columns, i.e., more levels in coarsening, the Frobenius norm error for rand+coarsen approaches that of full coarsening. However, note that the coarsening procedure is expensive compared to column norm sampling.

Dataset Size Rank Coarsening levSamp
levels error error
CRAN 1398 25 88 4 496.96 501.32
50 88 4 467.49 477.25
150 175 3 375.40 383.23
MED 1033 50 65 4 384.91 376.23
100 130 4 341.51 339.01
TIME 425 25 107 2 411.71 412.77
50 107 2 371.35 372.66
50 54 3 389.69 391.91
Kohonen 4470 25 981 2 31.89 36.36
Erdos992 6100 50 924 3 100.9 99.29
FA 10617 50 2051 3 26.33 28.37
chipcool0 20082 100 1405 4 6.05 6.14
Table 2: CSSP: Coarsening versus leverage score sampling.

In all the above experiments, we have used maximum matching for coarsening. The choice of , the parameter that decides the angle for matching does not seem to affect the errors directly. If we choose smaller , we will have a larger coarse matrix (fewer columns are combined) and the error will be small. If we choose a larger , more columns are combined and the results are typically equivalent to just simply using maximum matching ignoring the angle constraint. Thus, in general, the performance of the coarsening technique depends on the structure of the considered matrix. If we have more columns that are close to each other, i.e., make smaller angle between each other, the coarsening technique will combine more columns, we can choose a smaller and yet obtain good results. If the matrix is very sparse or if the columns make large angles between each other, coarsening might not yield a coarse matrix since it will not be able to match many columns. Therefore, selecting the smallest that will yield a small coarse matrix and yet lead to good approximations will depend on the structure of the input matrix.

6.2 Column Subset Selection

In the following experiment, we compare the performance of the coarsening method against the leverage score sampling method for column subset selection. We report results for the same three term-by-document datasets used in the first set of experiments. We also include results obtained for a few sparse matrices from the SuiteSparse matrix collection.

Table LABEL:tab:table1 presents a few comparisons. The errors reported are the Frobenius norm errors , where is the projector onto , and is the coarsened/sampled matrix which is computed by the multilevel coarsening method or using leverage score sampling of with the top singular vectors as reported in the second column. The number of columns in each test is reported in the third column which is the same for both methods. Recall that for CSSP, the coarsening and sampling algorithms do not perform a post-scaling of the columns that are selected. We see that the multilevel coarsening method performs very well and is comparable with leverage score sampling in most cases. Note that the standard leverage score sampling requires the computation of the top singular vectors and this can substantially more expensive than coarsening especially when is large.

6.3 Graph Sparsification

The next experiment illustrates how the coarsening method can be used for graph sparsification. We again compare the performance of the coarsening approach to the leverage score sampling method [27] for graph spectral sparsification. Recall that spectral sparsification accounts to computing a sparse graph that approximates the original graph such that the singular values of the graph Laplacian of are close to those of , Laplacian of .

Dataset Coarsening levSamp
levels error error
sprand 1290 332 0.29 2 0.541 0.575
1951 499 0.28 2 0.542 0.579
2676 679 0.27 2 0.537 0.580
Maragal4 6005 460 0.11 4 0.416 0.569
rosen1 12599 1738 0.18 3 0.482 0.304
G1 19176 2486 0.14 3 0.549 0.635
bibd13-6 25428 1619 0.08 4 0.901 0.920

Table 3: Graph Sparsification: Coarsening versus leverage score sampling. Error

Table LABEL:tab:tabspar lists the errors obtained when the coarsening and the leverage score sampling approaches were used to compute a sparse graph for different sparse random graphs and few matrices related to graphs from the SuiteSparse database. Given a graph , we can form a vertex edge incidence matrix , such that the Laplacian . Then, sampling/coarsening the rows of to get gives us a sparse graph with Laplacian . The type of graph or the names are given in the first column of the table and the number of rows in corresponding vertex edge incidence matrix is given in the second column. The number of rows in the coarse matrix is listed in the third column. The ratios of sparsity in and are also given. This ratio indicates the amount of sparsity achieved by sampling/coarsening. Since, we have same number of rows in the coarsened and sampled matrix , this ratio will be the same for both methods. The error reported is the normalized mean absolute error in the singular values of and , Error, which tells us how close the sparser matrix is to spectrally. We see that in most cases, the coarsening approach performs similarly to or better than leverage score sampling.

6.4 Applications

In this section, we illustrate the performance of the coarsening technique in the various applications introduced in section LABEL:sec:appl.

6.4.1 Latent Semantic Indexing

The first application we consider is Latent Semantic Indexing (LSI) [13, 33]. In LSI, we have a term-document matrix , representing documents and terms that frequently occur in the documents, where is the frequency of the th term in the -th document. A query is an -vector , normalized to , where the th component of a query vector is interpreted as the frequency with which the th term occurs in a topic. Typically, the number of topics to which the documents are related is smaller than the number of unique terms . Hence, finding a set of topics that best describe the collection of documents for a given , corresponds to keeping only the top singular vectors of , and obtaining a rank approximation. The truncated SVD and related methods are often used in LSI applications. The argument is that a low rank approximation captures the important underlying intrinsic semantic associated with terms and documents, and removes the noise or variability in word usage [33]. In this experiment, we employ the Coarsen SVD and leverage score sampling SVD algorithms to perform information retrieval techniques by Latent Semantic Indexing (LSI) [51].

Given a term-by-document data , we normalize the data using TF-IDF (term frequency-inverse document frequency) scaling. We also normalize the columns to unit vectors. Query matching is the process of finding the documents most relevant to a given query .

Figure 5: LSI results for the MEDLINE dataset on left and TIME dataset on the right.

Figure LABEL:fig:LSI_1 plots the average precision against the dimension/rank for MEDLINE and TIME datasets. When the term-document matrix is large, the computation of the SVD factorization can be expensive for large ranks . The multi-level techniques will find a smaller set of document vectors, denoted by, to represent (). For leverage score sampling, we sample using leverage scores with equal to the rank shown on the axis. Just like in the standard LSI, we compute the truncated SVD of , where is the rank. Now the reduced representation of is . Each query is transformed to a reduced representation . The similarity of and are measured by the cosine distance between and for . This example clearly illustrates the advantage of the coarsening method over randomized sampling and leverage scores. The multilevel coarsening method performs better than the sampling method in this application and in some cases it performs as well as the truncated SVD method. Multilevel coarsening algorithms for LSI applications, have been discussed in [51] where additional details can be found.

Figure 6: Purity and entropy values versus dimensions for three types of clustering for ORL dataset.

6.4.2 Projective clustering

The next application we consider is a set of nonlinear projection based clustering techniques. We illustrate how the multilevel coarsening methods can be used for data reduction in this application. We consider three types of nonlinear projection methods, namely, Isomap [58], Local Linear Embedding (LLE) [47] and Laplacian Eigenmaps [4]. Multilevel algorithm have been used in the clustering application, for example, article [41] uses a multilevel algorithm, based on MinMaxCut, for document clustering, and Fang et. al. [20] applied the mutlilevel algorithms for spectral clustering and manifold learning.

Given data-points, most of the projective clustering methods start by constructing a graph with edges defined based on certain criteria such as new distance metrics or manifolds, nearest neighbors, points on a same subspace, etc. The graph Laplacian corresponding to the graph is considered, and for a given , the top eigenvectors of a shifted Laplacian matrix, whose top eigenvectors correspond to the bottom eigenvectors of the original graph, are used to cluster the points. We use the following two evaluation metrics to analyze the quality of the clusters obtained, namely purity and entropy [69] given by:

and

where is the number of clusters, is the number of entries of class in cluster , and is the number of data in cluster . Here, we assume that the labels indicating the class to which data belong are available.

In figure LABEL:fig:clustering_1 we present results for three types of projective clustering methods, viz., Isomap, LLE and eigenmaps when coarsening was used before dimensionality reduction. The dataset used is the popular ORL face dataset [52], which contains subjects and grayscale images each of size with various facial expressions (matrix size is ). For the projective methods, we first construct a -nearest neighbor graph with , and use embedding dimensions . Note that even though the data is dense, the kNN graph is sparse. The figure presents the purity and entropy values obtained for the three projective clustering methods for these different dimensions with (circle) and without (triangle) coarsening the graph. The solid lines indicate the results when kmeans was directly used on the data without dimensionality reduction. We see that the projective methods give improved clustering quality in terms of both purity and entropy, and coarsening further improves their results in many cases by reducing redundancy. This method was also discussed in [19] where additional results and illustrations with other applications can be found.

6.4.3 Genomics - Tagging SNPs

The third application we consider is that of DNA microarray gene analysis. The data from microarray experiments is represented as a matrix , where indicates whether the th expression level exists for gene . Typically, the matrix could have entries