On Sparse Representations of Linear Operators and the Approximation of Matrix Products
Abstract
Thus far, sparse representations have been exploited largely in the context of robustly estimating functions in a noisy environment from a few measurements. In this context, the existence of a basis in which the signal class under consideration is sparse is used to decrease the number of necessary measurements while controlling the approximation error. In this paper, we instead focus on applications in numerical analysis, by way of sparse representations of linear operators with the objective of minimizing the number of operations needed to perform basic operations (here, multiplication) on these operators. We represent a linear operator by a sum of rankone operators, and show how a sparse representation that guarantees a low approximation error for the product can be obtained from analyzing an induced quadratic form. This construction in turn yields new algorithms for computing approximate matrix products.
I Introduction
\PARstartOperations on large matrices are a cornerstone of computational linear algebra. With a few exceptions such as the approximate Lanczos and power methods, most algorithms used by practitioners aimed to optimize speed of computation under the constraint of obtaining an exact result. Recently, spurred by the seminal paper of Frieze et al. [1], there has been a greater interest in finding algorithms which sacrifice the precision of the result for a gain in the speed of execution.
Consider the lowrank approximation problem; i.e., finding a matrix of rank at most which approximates a given matrix . The best matrix , best in the sense that it minimizes for any unitarily invariant norm (e.g., spectral or Frobenius norms), can be obtained by computing the singular value decomposition (SVD) of . (Throughout this paper, we adopt the Frobenius norm; we use the notation to denote the best rank approximation to and to denote an approximation to it—it will be easy to avoid confusion with , which is used to denote the column of ) But in some instances, evaluating the SVD, which scales as where is the largest dimension of , may be too costly. Frieze et al. in [1] showed that can be reasonably well approximated by computing the SVD of a subset of the columns of only, where the columns are sampled according to their relative powers—i.e., , with the expected error coming from using the approximation instead of being of the form . In subsequent papers it has been argued that the additive error term in may be large, and thus other sampling techniques have been introduced to obtain relative approximation error bounds (see, e.g., [2, 3]).
In this paper, we address the sparse representation of linear operators for the approximation of matrix products. An important object that will appear in our study is the socalled Nyström method (see Section II) to find a lowrank approximation to a positive kernel. This method, familiar to numerical analysts, has nowadays found applications beyond its original field, most notably in machine learning. In previous work [4], we proposed an approach for lowrank approximation in such applications; here we show that the task of finding optimal sparse representation of linear operators, in order to evaluate their product, is related to the Nyström extension of a certain positive definite kernel. We will the use this connection to derive and bound the error of two new algorithms, which our simulations indicate perform well in practice.
Related to our work is that of [5] for matrix products. In that paper, Drineas et al. showed that a randomized algorithm sampling columns and rows of and in proportion to their relative powers and yields an expected error of . Notice that this bound does not involve a lowrank approximation of or . In contrast, we obtain a randomized algorithm bound in which the approximating rank of a kernel related to and appears explicitly.
The methods mentioned above are all adaptive, in the sense that they require some knowledge about and . A very simple nonadaptive method is given by an application of the JohnsonLindenstrauss Lemma: it is easy to show [6] that if is a matrix with independent unit Normal entries and , then for we have that
Letting denote the row of and the column of , we observe the following elementwise relation:
and thus we see that approximating by yields a good result with high probability. Later we compare this method to our algorithms described below.
The remainder of this paper is organized as follows. In Section II, we briefly review the Nyström method used to approximate positive definite matrices [7, 8]. In Section III, we introduce the problem of approximating a matrix product and highlight two key aspects: the issue of best subset selection and the issue of optimal rescaling. We then solve the optimal rescaling problem and analyze a randomized and a deterministic algorithm for subset selection; we conclude with simulations and a brief discussion of algorithmic complexity.
Ii Approximation Via The Nyström Method
To provide context for our results, we first introduce the socalled Nyström method to approximate the eigenvectors of a symmetric positive semidefinite (SPSD) matrix.
Iia The Nyström Method for Kernel Approximation
The Nyström method, familiar in the context of finite element methods, has found many applications in machine learning and computer vision in recent years (see, e.g., [8] and references therein). We give here a brief overview : Let be a positive semidefinite kernel and , , denote pairs of eigenvalues and eigenvectors such that
(1) 
The Nyström extension is a method to approximate the eigenvectors of based on a discretization of the interval . Define the points by with , so that the ’s are evenly spaced along the interval . Then form the Gram matrix , which in turn is used to approximate (1) by a finitedimensional spectral problem
The Nyström extension then uses these to give an estimate of the eigenfunction as follows:
This method can also be applied in the context of matrices. Let be an SPSD matrix, partitioned as
where and is typically much smaller than . It is then possible to approximate eigenvectors and eigenvalues of by using the eigendecomposition of as follows. Define and with orthogonal and diagonal. The Nyström extension then tells us that an approximation for eigenvectors in is given by
These approximations and in turn yield an approximation to as follows:
The quality of this approximation can then be measured as the (e.g., Frobenius) norm of the Schur complement of in :
Iii Approximation of Matrix Products
Let and . We use the notation to denote the columns of and the rows of . We can write the product as the sum of rankone matrices as follows:
(2) 
Our approach to estimate the product , akin to model selection in statistics, will consist of keeping only a few terms in the sum of (2); this entails choosing a subset of columns of and of rows of , and rescaling their outer products as appropriate. To gain some basic insight into this problem, we may consider the following two extreme cases with and . First, suppose that the vectors and are collinear. Then , hence we can recover the product without error by only keeping one term of the sum of (2) and rescaling it appropriately provided that we know the correlation between and . At the other extreme, if and are orthogonal, rescaling will not decrease the error no matter which term in (2) is kept.
Hence we see that there are two key aspects to the problem of sparse matrix product approximation as formulated above:
 Optimal Model Selection

Which rows/columns should be retained?
 Optimal Reweighting

How should these rows/columns be rescaled?
As we show below, the latter of these problems can be solved exactly for a relatively low complexity. For the former, which is combinatorial in nature and seemingly much harder to solve, we give an efficient approximation procedure.
Iv Solving the Optimal Reweighting Problem
We first consider the problem of optimal reweighting, conditioned upon a choice of subset. In particular, suppose that an oracle gives us the best subset of cardinality to estimate the product . Without loss of generality, we assume that . We then have the following result characterizing how well one can estimate the product :
Theorem 1.
Let the SPSD matrix be defined as (i.e., , where is the Hadamard or entrywise product of matrices) and have the partition
(3) 
where without loss of generality, and is the corresponding principal submatrix.
Then the best approximation to the product using the terms is given by
(4) 
where
and
Moreover, if is the matrix with all entries equal to one, then the squared approximation error in Frobenius norm is given by
with the Schur complement of in .
This result tells us how well we can approximate the product granted that we know only a few rows/columns of and , and their correlations with the remaining rows and columns. It also allows us to characterize the best subset of size ; it is the subset that minimizes .
Proof:
Given the subset of , we seek the best scaling factors to minimize the squared approximation error . We can write the squared error as
By distributing the product and using the linearity of the trace, we get
(5) 
where we made use of the following equality:
We now work towards rewriting (5) in a more manageable form. First, using the fact that , we see that
(6) 
Similarly, using (6), we get after an easy computation that
We now rewrite (5) in a more manageable form:
(7) 
The optimal weight vector is now obtained by setting the gradient of (7) to zero. Hence we obtain
which proves the first part of the statement.
For the second part, first notice that if is the vector whose entries are all one, we have the following expression for :
Hence, at the optimum, the error is
where we see that is the Nyström approximation of as described in Section II. Using Lemma 1 below, we have
which finishes the proof of the Theorem.
∎
The proof of the second part of Theorem 1 is based on the identity proven below:
Lemma 1.
Let and be real matrices of dimensions and , respectively, and let be the matrix with all entries equal to one. The following identity holds:
(8) 
Proof:
Recall that we can write the product as a sum of rankone terms as follows:
We thus have, by definition of the Frobenius norm, that
Using the invariance of the trace with respect to cyclic permutations, the last equation yields
and the relation (8) is proved. ∎
Iva Approximating the Optimal Subset Selection Procedure
Having shown a solution to the optimal reweighting problem according to Theorem 1, we now turn our attention to the companion problem of optimal subset selection. In order to minimize the approximation error, we have to find the subset whose associated Schur complement has the lowest possible power along the onedimensional subspace of spanned by the vector . Determining the eigenvectors and eigenvalues of this Schur complement, and relating them to and , is not an easy task. Here we present two approximations: one based on a random choice of subsets, and an alternative “greedy” approach which yields a worstcase error bound.
IvA1 Random Choice of Subset
We first discuss a random oracle which outputs a subset with probability defined below. Recall our earlier definition of the matrix according to Theorem 1; this approach is motivated by the expression of the resultant squared error, conditioned upon having chosen a subset , as . Since is positive definite, we have that is larger than the largest eigenvalue of , and we can bound this error as follows:
(9) 
Note that equality is obtained when , and hence this bound is tight. We have investigated in [9] an algorithm to minimize , which has been shown to be effective in the context of lowrank covariance matrix approximation.
Returning to our random oracle, note that both and are positive definite, and thus by the Schur Theorem [10], is also positive definite. From this, we conclude:

There exists a matrix such that ;

All the principal minors of are positive.
Consequently, we assume here that an oracle returns a subset with probability
(10) 
where is a normalizing constant, and the second fact above ensures that this probability distribution is well defined. We may then adapt the following result from the proof of Theorem 1 in [4]:
Theorem 2.
Let be a positive quadratic form with eigenvalues . If is chosen with probability , then
(11) 
Combining (9) with (11) leads, via Jensen’s inequality, to an upper bound on the average error of this approach to random subset selection:
where is defined via the relation , and denotes the optimal rank approximation to obtained by truncating its singular value decomposition.
Despite the appearance of the term in this bound, it serves to relate the resultant approximation quality to the ranks of and , a feature reinforcing the wellfoundedness of the accompanying algorithm we present below. In particular, if , then the approximation error is zero as expected. For practical reasons, we may also wish to relate this error to the eigenvalues of and . To this end, let and be two matrices, , and let (resp. , ) be the singular values of (resp. , ) sorted in nonincreasing order. We then have the following majorization relation [11]:
In particular, if , , and , then the singular values of , , and are the squares of the singular values of and respectively:
(12) 
We may then conclude from (12) that
Although the approach presented above relies on an oracle to sample in proportion to , we will subsequently outline a realizable algorithm based on these results.
IvA2 Deterministic Choice of Subset
Recall that Theorem 1 indicates we should ensure that the diagonal terms of are kept as small as possible. Hence, as a deterministic approximation to the optimal subset selection procedure, we may take such that it contains the indices of the largest terms . While yielding only a worstcase error bound, this approach has the advantage of being easily implementable (as it does not require sampling according to ); it also appears to perform well in practice [9]. This greedy algorithm proceeds as follows:
Since the error term is the sum of all the terms in the Schur complement, we can look to bound its largest element. To this end, we have the following result:
Lemma 2.
The largest entry in is smaller than the largest diagonal element of in (3).
This lemma confirms that a good errorminimization strategy is to make sure that the diagonal terms of are as small as possible, or equivalently to take such that it contains the indices of the largest as per Algorithm 1.
The proof of Lemma 2 is based on the following set of simple results:
Lemma 3.
If is a positive definite matrix, then is positive and on the diagonal of .
Proof:
Since is positive definite, we know there exists a matrix such that
By the CauchySchwartz inequality, we have
from which we deduce that one of the following inequalities has to be satisfied:
(13) 
Now if we suppose that is not a diagonal element, the relations of (13) yield a contradiction—and hence the largest entry of is on its main diagonal. ∎
The entries of , the Schur complement of in , can be characterized explicitly according to the following formula:
Lemma 4 (CrabtreeHaynsworth [12]).
Let be a nonsingular leading principal submatrix of obtained by keeping the rows and columns with indices . Then , the Schur complement of in , is given elementwise by
(14) 
Furthermore, it is possible to bound the diagonal entries of as follows:
Lemma 5 (Fischer’s Lemma [10]).
If is a positive definite matrix, then
We are now ready to give the proof of Lemma 2:
Proof:
Lemma 2 can be further refined to give a worstcase error bound for deterministic matrix product approximation, conditioned on a choice of subset and the corresponding optimal reweighting procedure. Appealing to the inequality of arithmetic and geometric means to further bound the elements of , the results of Theorem 1 and Lemmas 3–5 yield:
V Simulation Studies and Complexity
Va Experimental Results
We now present preliminary experimental results and discuss the computational complexity of the algorithms under consideration. Three sets of experiments were performed, in which we compared the performance of four subset selection methods: a baseline uniform sampling on subsets; sampling according the row/column powers [5]; sampling in proportion to the principal minors of according to (10); and selecting greedily according to Step 1 of Algorithm 1. We also compared the choice of reweighting following subset selection, in one case applying the optimal reweighting of Theorem 1 and in the other simply reweighting according to the row/column powers (see [1, 5]):
(15) 
To test these various experimental conditions, we drew 200 random matrices and in total, each having independent unit Normal entries. We then averaged the error of the randomized algorithm over 20 trials per matrix product, and report relative error in dB as for each test condition.
In the first set of experiments, shown in Figure 1, we compare the four different algorithms for subset selection described above, applied in conjunction with a reweighting according to row/column powers. The highesterror method in this case corresponds to choosing the subset uniformly at random, and thus should be understood as a baseline measure of performance as a function of approximant rank . It can also be seen that sampling according to the relative powers of the row/columns of and , and sampling via a MetropolisHastings algorithm (with independent proposal distributions taken in proportion to row/column powers), yield similar results, with both improving upon the baseline performance. The best results in this case are obtained by the greedy subset selection method indicated by Step 1 of Algorithm 1.
In a second set of experiments, we followed the same procedure as above to compare subset selection procedures, but applied the optimal reweighting of Theorem 1 rather than a rescaling according to row/column powers. Performance in this case is (as expected) seen to be better overall, but with the ordering of the methods unchanged. As our final experiment, we compare the method of Algorithm 1 (greedy subset selection followed by optimal rescaling) to two nonadaptive methods: choosing row/columns of and uniformly at random and rescaling according to , and the simple JohnsonLindenstrauss random projection approach outlined in Section I. These nonadaptive methods can be seen to yield significantly worse performance than Algorithm 1, suggesting its potential as a practical method of selecting sparse representations of linear operators that yield low approximation errors for the resultant matrix products.
We conclude with a brief discussion of the algorithmic complexity of Algorithm 1. First, assume without loss of generality that , and recall that straightforward matrix multiplication requires operations, though the best algorithm known so far (the CoppersmithWinograd algorithm [13]) can perform this operation in . Evaluating in Algorithm 1 requires the computation of inner products of or dimensional vectors, and hence requires operations. Extracting the largest elements of a set of size , as is necessary to construct , can be done efficiently using a variation on the Quicksort algorithm (see [14]) in . The matrix is symmetric and its diagonal is a restriction of . Hence it requires the computation of an additional inner products, and thus operations. Evaluating requires operations, taking into account the fact that terms of the sum also appear in . Finally, evaluating can be done using Gaussian elimination in operations. Hence the overall complexity is given by .
References
 [1] A. M. Frieze, R. Kannan, and S. Vempala, “Fast MonteCarlo algorithms for finding lowrank approximations,” in Proc. 39th IEEE Sympos. Found. Comp. Sci. (FOCS), 1998, pp. 370–378.
 [2] A. Deshpande and S. Vempala, “Adaptive sampling and fast lowrank matrix approximation,” in Proc. 10th Internat. Worksh. Randomizat. Computat. (RANDOM’06), 2006.
 [3] T. Sarlos, “Improved approximation algorithms for large matrices via random projection,” in Proc. 47th IEEE Sympos. Found. Comp. Sci. (FOCS), 2006, pp. 143–152.
 [4] M.A. Belabbas and P. J. Wolfe, “Spectral methods in machine learning: New strategies for very large datasets,” submitted to Proc. Natl. Acad. Sci. USA, 2007.
 [5] P. Drineas, R. Kannan, and M. W. Mahoney, “Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication,” SIAM J. Comput., vol. 36, pp. 132–157, 2006.
 [6] R. I. Arriaga and S. Vempala, “An algorithmic theory of learning: Robust concepts and random projection,” in Proc. 40th IEEE Sympos. Found. Comp. Sci. (FOCS), 1999, pp. 616–623.
 [7] C. K. I. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Neural Information Processing Systems, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds., pp. 682–688. MIT Press, 2001.
 [8] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the Nyström method,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 2, pp. 214–225, 2004.
 [9] M.A. Belabbas and P. J. Wolfe, “Fast lowrank approximation for covariance matrices,” in Proc. 2nd IEEE Intl. Worksh. Computat. Adv. MultiSensor Adapt. Process., 2007, in press.
 [10] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, 1999.
 [11] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, 1994.
 [12] D. E. Crabtree and E. V. Haynsworth, “An identity for the Schur complement of a matrix,” Proc. Amer. Math. Soc., vol. 22, pp. 364–366, 1969.
 [13] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” J. Symbol. Computat., vol. 9, no. 3, pp. 251–280, 1990.
 [14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, 2001.