The Input/Output Complexity of Sparse Matrix Multiplication^{†}^{†}thanks: This work is supported by the Danish National Research Foundation under the Sapere Aude program.
Abstract
We consider the problem of multiplying sparse matrices (over a semiring) where the number of nonzero entries is larger than main memory. In the classical paper of Hong and Kung (STOC ’81) it was shown that to compute a product of dense matrices, I/Os are necessary and sufficient in the I/O model with internal memory size and memory block size .
In this paper we generalize the upper and lower bounds of Hong and Kung to the sparse case. Our bounds depend of the number of nonzero entries in and , as well as the number of nonzero entries in .
We show that can be computed using I/Os, with high probability. This is tight (up to polylogarithmic factors) when only semiring operations are allowed, even for dense rectangular matrices: We show a lower bound of I/Os.
While our lower bound uses fairly standard techniques, the upper bound makes use of “compressed matrix multiplication” sketches, which is new in the context of I/Oefficient algorithms, and a new matrix product size estimation technique that avoids the “no cancellation” assumption.
1 Introduction
In this paper we consider the fundamental problem of multiplying matrices that are sparse, that is, the number of nonzero entries in the input matrices (but not necessarily the output matrix) is much smaller than the number of entries. Matrix multiplication is a fundamental operation in computer science and mathematics, due to the wide range of applications and reductions to it — e.g. computing the determinant and inverse of a matrix, or Gaussian elimination. Matrix multiplication has also seen lots of use in nonobvious applications such as bioinformatics [24], computing matchings [22, 18] and algebraic reasoning about graphs, e.g. cycle counting [2, 3].
Matrix multiplication in the general case has been widely applied and studied in a pure math context for decades. In an algorithmic context matrix multiplication is known to be computable using ring operations, for some constant between 2 and 3. The first improvement over the trivial cubic algorithm was achieved in in the seminal work of Strassen [23] showing and most recently Vassilevska Williams [26] improved this to .
Matrix multiplication over a semiring, where additive inverses cannot be used, is better understood. In the I/O model introduced by Aggarwal and Vitter [1] the optimal matrix multiplication algorithm for the dense case already existed (see Section 1.2) and since then sparsedense and sparsesparse combinations of vector and matrix products have been studied, e.g. in [6, 12, 19].
The main contribution of this paper is a tight bound for matrix multiplication over a semiring in terms of the number of nonzero entries in the input and output matrices, generalizing the classical result of Hong and Kung on dense matrices [14] to the sparse case.
1.1 Preliminaries
Let and be matrices of rows and columns and let every entry for semiring . Further for matrix let denote row of and let denote column of . The matrix product , where each entry , is given as . A nonzero term is referred to as an elementary product. We say that there is no cancellation of terms when implies that for all . For sparse semiring matrix multiplication, the number of entry pairs with nonzero product measures the number of operations performed up to a constant factor assuming optimal representation of the matrices. Specifically, let be the number of such nonzero pairs of matrix entries. Finally let denote the number of nonzero entries of matrix . When no explicit base is stated, logarithms in this paper are base .
External memory model. This model of computation [1] is an abstraction of a twolevel memory hierachy: We have an internal memory holding data items (“words”) and a disk of infinite size holding the remaining data. Transfers between internal memory and disk happen in blocks of words, and a word must be in internal memory to be manipulated. The cost of an algorithm in this model is the number of block transfers (I/Os) done by the algorithm. We will use as shorthand for the sorting complexity of data items in the external memory model and notation to suppress polylogarithmic factor in input size and matrix dimension .
We assume that a word is big enough to hold a matrix element from a semiring as well as the matrix coordiantes of that element, i.e., a block holds matrix elements. We restrict attention to algorithms that work with semiring elements as an abstract type, and can only copy them, and combine them using semiring operations. We refer to this restriction as the semiring I/O model. Our upper bound uses a slight extension of this model in which equality check is allowed, which allows us to take advantage of cancellations, i.e., inner products in the matrix product that are zero in spite of nonzero elementary products.
The problem we solve. Given matrices and containing and nonzero semiring elements, respectively, we wish to output a sparse representation of the matrix product in the external memory model. We are dealing with sparse matrices represented as a list of tuples of the form , where is a (nonzero) matrix entry. To produce output we must call a function for every nonzero entry of . We only allow to be called once on each output element, but impose no particular order on the sequence of outputs.
We note that the algorithm could be altered to write the entire output before termination by, instead of calling , simply writing the output element to a disk buffer, outputting all elements using additional I/Os. However, in some applications such as database systems (see [5]) there may not be a need to materialize the matrix product on disk, so we prefer the more general method of generating output.
1.2 Related work
The external memory model was introduced by Aggarwal and Vitter in their seminal paper [1], where they provide tight bounds for a collection of central problems.
An I/Ooptimal matrix multiplication algorithm for dense semiring matrices was achieved by Hong and Kung [14]: Group the matrices into submatrices where constant is picked such that three matrices fit into internal memory. This reduces the problem to matrix products that fit in main memory, costing I/Os each, and hence in total [10]. Hong and Kung also provided a tight lower bound that holds for algorithms that work over a semiring. (It does not apply to algorithms that make use of subtraction, such as fast matrix multiplication methods, for which the blocking method described above yields an I/O complexity of I/Os.)
For sparse matrix multiplication the previously best upper bound [5], shown for Boolean matrix products but claimed for any semiring, is .
It seems that this bound requires “no cancellation of terms” (or more specifically, the output sensitivity is with respect to the number of output entries that have a nonzero elementary product). Our new upper bound of this paper improves upon this: The Monte Carlo algorithm of Theorem 1.1 has strictly lower I/O complexity for the entire parameter space and makes no assumptions about cancellation.
An important subroutine in our algorithm is densevector sparse matrix multiplication: For a vector and sparse matrix we can compute their product using optimal I/Os [6]  this holds for arbitrary layouts of the vector and matrix on disk.
Our algorithm has an interesting similarity to Williams and Yu’s recent output sensitive matrix multiplication algorithm [25, Section 6]. Their algorithm works by splitting the matrix product into 4 submatrices of equal dimension, running a randomized test to determine which of these subproblems contain a nonzero entry. Recursing on the nonzero submatrices, they arrive at an output sensitive algorithm. We perform a similar recursion, but the splitting is computed differently in order to recurse in a balanced manner, such that each subproblem at a given level of the recursion outputs approximately the same number of entries in the matrix product.
Size estimation of the number of nonzeros in matrix products was used by Cohen [9, 8] to compute the order of multiplying several matrices to minimize the total number of operations. For constant error probability this algorithm uses operations in the RAM model to perform the size estimation. For Amossen et al [4] improved the running time to be expected in the RAM model and expected in the I/O model. Contrary to the approaches of [4, 9, 8] our new size estimation algorithm presented in Section 2 is able to deal with cancellation of terms, and it uses I/Os. Informally, the main idea of our size estimation algorithm is to multiply a sequence of vectors with certain properties onto but in the computationally inexpensive order , in order to produce linear sketches of the rows (columns) of .
1.3 New results
We present a new upper bound in the I/O model for sparse matrix multiplication over semirings. Our I/O complexity is at least a factor of roughly better than that of [5]. We show the following theorem:
Theorem 1.1
Let and be matrices with entries from a semiring , and let , . There exist algorithms (a) and (b) such that:

emits the set of nonzero entries of with probability at least , using I/Os.

emits the set of nonzero entries of , and uses I/Os.
For every and , using I/Os we can determine with probability at least if one of the two I/O bounds is significantly lower, i.e., distinguish between and .
The above theorem makes no assumptions about cancellation of terms. In particular, can be smaller than the number of output entries that have nonzero elementary products.
Our second main contribution is a new lower bound on sparse matrix multiplication in the semiring I/O model.
Theorem 1.2
For all positive integers and there exist matrices and with , , such that computing in the semiring I/O model requires I/Os.
Since we can determine and run the algorithm satisfying the minimum complexity of the lower bound, our bounds are tight.
Paper structure. Section 2 describes a new size estimation algorithm which we will use as a subprocedure for our sparse matrix multiplication algorithm. The new size estimation algorithm may be of independent interest since to the knowledge of the authors there are no published size estimation procedures that handle cancellation of terms. Section 3 first describes a simple output insensitive algorithm in Section 3.1, algorithm (b) of Theorem 1.1. Then we describe how algorithm (a) of Theorem 1.1 works: In Section 3.4 we describe how to divide the sparse matrix product into small enough subproblems (with respect to output size), and Section 3.3 desribes how a version of Pagh’s “compressed matrix multiplication” algorithm yields an I/O efficient algorithm for subproblems with a small output. Finally in Section 4 we show the new tight lower bound of Theorem 1.2.
2 Matrix output size estimation
We present a method to estimate column/row sizes of a matrix product , represented as a sparse matrix. In particular, for a column (or analogously row ) we are interested in estimating the number of nonzeros (. We note that there are no assumptions about (absence of) cancellation of terms in the following. We show the existence of the following algorithm.
Lemma 1
Let and be matrices with entries from semiring , and let . We can compute estimates using I/Os and RAM operations such that with probability at least it holds that for all .
We note that Lemma 1 by symmetry can give the same guarantees for rows of the matrix product, which is done analogously by applying the algorithm to the product . Further, from Lemma 1 we have, following from combining of all column estimates, an estimate of .
Corollary 1
Let and be matrices with entries from semiring , and let . We can compute in I/Os and RAM operations such that with probability at least it holds that .
At a high level, the algorithm is similar in spirit to Cohen [9, 8], but uses linear sketches (see e.g. [11, 15]) that serve the purpose of capturing cancellation of terms.
We will make use of a wellknown sketching method [11, 16], where denotes the number of nonzero entries in a vector . Let be a data stream of items of the form , where and . The stream defines a vector indexed by (which can also be thought of as a matrix), where entry is the sum of all ring elements that occurred with index in the stream.
For a matrix the number of distinct indices is the sum of distinct indices over all column vectors . One can compute in space [11, 16] a linear sketch over that can output a number , where with constant probability.
Highlevel algorithm description.
We compute a linear sketch followed by the matrix product . From for a given we can distinguish between a column having more than and less than nonzero entries  we repeat this procedure for suitable values of to achieve the final estimate. We use the following distinguishability result:
Fact 2.1
([16], Section 2.1) There exists a projection matrix such that for each frequency vector we can be estimate from . In particular, for fixed , with probability we can distinguish the cases and using space .
We will apply this distinguishability sketch to the columns of the product , since implies and analogously for the second case. This follows trivially from the definition of and the number of nonzeroes in a matrix product. From LFact 2.1 we have a sketch which multiplied with a matrix we can for the columns distinguish from with probability .
Proof
(Corollary 1) Let be a distinguishability sketch as described in LFact 2.1. To ensure that for every of the columns in we can distinguish the two cases with probability at least it is sufficient to invoke the algorithm from LFact 2.1 with . By the union bound over the error probabilities we have . By linearity of we have that from we can for all columns distinguish the cases and .
Also by linearity, the order of operations in the computation of is , hence the computation of can be seen as densevector sparsematrix multiplications.
For dense vector and sparse matrix we can compute in I/Os [6]. Letting computing for a value of has I/O complexity
(1) 
We note that for sparse matrices this bound is . Analogously, the number of RAM operations needed to compute for a specific is .
Since for a given we can now using I/Os given in (1) distinguish and we simply repeat this procedure for values , which yields a estimate of the number of nonzeroes in each column, from which we have the desired estimate of the total number of nonzeroes.
We note that the algorithm of Corollary 1 can be obtained using any linear sketch in I/O complexity , where is the space complexity of the sketch used.
3 Cacheaware upper bound
As in the previous section let and be matrices with entries from a semiring , and let be the input size.
3.1 Output insensitive algorithm
We first describe algorithm (a) of Theorem 1.1, which is insensitive to the number of output entries . It works as follows: First put the entries of in columnmajor order by lexicographic sorting. For every row of with more than nonzeros, compute the vectormatrix product in time using the algorithm of [6]. There can be at most such rows, so the total time spent on this is . The remaining rows of are then gathered in groups with between and nonzero entries per group. In a single scan of (using columnmajor order) we can compute the product of each such row with the matrix . The number of I/Os is for each of the at most groups, so the total complexity is .
3.2 Monte Carlo algorithm overview
We next describe algorithm (b) of Theorem 1.1 The algorithm works by first performing a step of color coding, the purpose of which is to split the matrix product into submatrices, each of which can be computed efficiently. Roughly, the idea is to color the rows of and columns of , forming submatrices and corresponding to each color, such that every matrix product has roughly nonzero elements. Then, a “compressed” matrix multiplication algorithm (described by Lemma 2) is used to compute every product by a single scan over the matrices and . The number of colors needed to achieve this, to be specified later, depends on an estimate of , found using Corollary 1. It turns out to simplify the algorithm if we deviate slightly from the above coloring, by using different colorings of the rows of for each . That is, we will work with different decompositions of , and compute products of the form . Also, there might be rows of and columns of that we cannot color because they generate too many entries in the output. However, it turns out that we can afford to handle such rows/columns in a direct way using vectormatrix multiplication.
3.3 Compressed matrix multiplication in the I/O model
Let be a suitably small constant, and define . We now describe an I/Oefficient algorithm for matrix products with nonzeros. If is stored in columnmajor order and is stored in rowmajor order, the algorithm makes just a single scan over the matrices.
The algorithm is a variation of the one found in [19], adapted to the semigroup I/O model. Specifically, for some constant and let be pairwise independent hash functions. The algorithm computes the following polynomials of degree at most :
It is not hard to see that the polynomial can be computed in a single scan over column of , using space . Similarly, we can compute the polynomial in space by scanning row of . As soon as both polynomials have been computed, we multiply them and add the result to the sum of products that will eventually be equal to . This requires additional space , for a total space usage of .
Though a computationally less expensive approach is described in [19], we present a simple method that (without using any I/Os) uses the polynomials , , to compute the set of entries in with probability . For every and , to compute the value of consider the coefficient of in , for . For suitably chosen , with probability the value is found in the majority of these coefficients. The majority coefficient can be computed using just equality checks among semigroup elements [7]. The analysis in [19] gives us, for a suitable choice of and , the following:
Lemma 2
Suppose matrix is stored in columnmajor order, and is stored in rowmajor order. There exists an algorithm in the semiring I/O model augmented with equality test, and an absolute constant , such that if the algorithm outputs the nonzero entries of with probability , using just a single scan over the input matrices.
3.4 Computing a balanced coloring
We wish to assign every row and column a color from . Let color set contain rows that are assigned color and for such a color assigned to rows of of let color set contain columns that are assigned color . Also, let be the input matrix restricted to contain only elements in rows from (and analogously for and ).
The goal of the coloring step is to assign the colors such that for every pair of color sets , it holds that . This can be seen as coloring the rows of once and the columns of times.
Lemma 3
Let and be matrices with nonzero entries.
Using I/Os a coloring with colors can be computed that assigns a
color to rows of and for each such color , assigns colors to columns of such that:

For every it holds that .

Rows from and columns form that are not in some color sets and has had their nonzero output entries emitted.
Proof
At a high level, the coloring will be computed by recursively splitting the matrix rows in two disjoint parts to form matrices and where contains the nonzeros from the first rows, for some , and contains the nonzeros from the last rows. Row number , the “splitting row”, will be removed from consideration by generating the corresponding part of the output using I/Oefficient vectormatrix multiplication. We wish to choose such that:

.

.
And after recursive levels of such splits, we will have disjoint sets of rows from . For each such set we then compute disjoint column sets of in the same manner, and we argue below that this gives us subproblems with output size , where each subproblem corresponds exactly to a pair of color sets as described above.
In order to compute the row number around which to perform the split, we invoke the estimation algorithm from Corollary 1 with such that for every row in we have access to an estimate where it holds with probability at least (for fixed chosen to get sufficiently low error probability):
(2) 
In particular for any set of rows we have that
(3) 
We will now argue that if we can create a split of the rows such that (1) and (2) hold, then when the splitting procedure terminates after recursive levels, we have that for each pair of colors it is the case that . Consider the case where each split is done with the maximum positive error possible, i.e., on recursive level we have divided the nonzeros into subproblems where each are of size at most . Remember that is the number of colors. After recursive levels we have subproblem size:
(4)  
(5)  
(6) 
The main observation to see that we get the right subproblem size as in (6) is that for each recursion we decrease the output size by a factor . For (4) we use and (5) follows from . The analysis for the case where each split is done with the maximum negative error possible is analogous and thus omitted.
We will now argue that with access to the estimates as in (2) we can always construct a split such that (1) and (2) hold. Let partitions 1 and 2 be denoted and and be the estimate of the total number of outputs for the current subproblem. Create by examining rows one at a time. If the estimated number of nonzeros of is less than then add to . Otherwise perform densevector sparsematrix multiplication using I/Os [6]and emit every nonzero of that product  this eliminates the row vector from matrix as all outputs generated by row has now been emitted. Because of (3) we have that the remaining rows of can now be placed in partition and the sum of their outputs will be at most . The procedure and analysis is equivalent for the case of columns. From (6) we had that even with splits of nonzeros then the subproblem size is the desired after all splits are done.
In terms of I/O complexity consider first the coloring of all rows in . First we perform the size estimates of Corollary 1 in such that we know where to split. Then we perform splits and each split also outputs the output entries for a specific row using densevector sparsematrix multiplication, hence this split takes I/Os. Finally for each of the sets of rows of we partition columns of in the same manner, first by invoking size estimations taking due to the sum of the nonzeros in the subproblems being at most . Then for each of the row sets we perform splits and output a column from . This step takes time and hence in total we use
3.5 I/O Complexity Analysis
Next, we will use Lemma 3 for the algorithm that shows part (b) of Theorem 1.1.
We summarize the steps taken and their cost in the external memory model.
Proof
(Theorem 1.1, part (b)) The algorithm first estimates with parameters and which by Corollary 1 uses I/Os. We then perform the coloring, outputting some entries of and dividing the remaining entries into balanced sets for . By Lemma 3 this uses I/Os. Finally we invoke the compressed matrix multiplication algorithm from Lemma 2 on each subproblem. This is possible since each subproblem has at most nonzeros entries in the output. The total cost of this is I/Os, since each nonzero entry in and is part of at most products, and the cost of each product is simply the cost of scanning the input.
4 Lower bound
Our lower bound generalizes that of Hong and Kung [14] on the I/O complexity of dense matrix multiplication. We extend the technique of [14] while taking inspiration from lower bounds in [13, 21, 20]. The closest previous work is the lower bound in [20] on the I/O complexity of triangle enumeration in a graph, but new considerations are needed due to the fact that cancellations can occur.
Like the lower bound of Hong and Kung [14], our lower bound holds in a semiring model where:

A memory block holds up to matrix entries (from the semiring), and internal memory can hold memory blocks.

Semiring elements can be multiplied and added, resulting in new semiring elements.

No other operations on semiring elements (e.g. division, subtraction, or equality test) are allowed.
The model allows us to store sparse matrices by listing just nonzero matrix entries and their coordinates. We note that our algorithm respects the constraints of the semiring model with one small exception: It uses equality checks among semiring elements.
We require the algorithm to work for every semiring, and in particular over fields of infinite size such as the real numbers, and for arbitrary values of nonzero entries in and . Since only addition and multiplication are allowed, we can consider each output value as a polynomial over nonzero entries of the input matrices. By the SchwartzZippel theorem [17, Theorem 7.2] we know that two polynomials agree on all inputs if and only if they are identical. Since we are working in the semiring model, the only way to get the term in an output polynomial is to directly multiply these input entries. That means that to compute an output entry we need to compute a polynomial that is identical to the sum of elementary products . It is possible that the computation of this polynomial involves other nonzero terms, but these are required to cancel out.
We now argue that for every and there exist matrices and with and , for which every execution of an external memory algorithm in the semiring model must use I/Os. Our lower bound holds for the best possible execution, i.e., even if the algorithm has been tailored to the structure of the input matrices.
The hard instance for the lower bound is a dense matrix product, which maximizes the number of elementary products. In particular, since we ignore constant factors we may assume that and are integers. Let be a by dense matrix, and let be a by dense matrix. Without loss of generality, every semiring element that is stored during the computation is either: 1) An input entry, or 2) Part of a sum that will eventually be emitted as the value of a unique nonzero element .
This is because these are the only values that can be used to compute an output entry (making use of the fact that additive and multiplicative inverses cannot be computed). This implies that every output entry can be traced through the computation, and it is possible to pinpoint the time in the execution where an elementary product is computed and stored in internal memory.
We use the following lemma from [13]:
Lemma 4
In space the number of elementary products that can be computed and stored is at most .
Following [20], observe that any execution of an I/O efficient algorithm can be split into phases of I/Os. By doubling the memory size to we find an equivalent execution where every read I/O happens at the beginning of the phase (before any processing takes place), and every write I/O happens at the end of the phase. For every phase we can therefore identify the set of at most input and output entries that involved in the phase.
If all values needed for emitting a particular output entry are present in a phase there may not be any storage location that can be associated with it. We first account for such direct outputs: Each direct output requires two vectors of length to be stored in main memory. In each phase we can store at most such vectors, resulting in at most output pairs. So the number of phases needed to emit, say, outputs would be at least , using I/Os. This means that to output a substantial portion of in this way we need at least this number of I/Os.
Next, we focus on output entries for which an elementary product is written to disk in some phase. By Lemma 4 the number of elementary products computed and stored is at most . If the total number of elementary products is then we need at least phases of I/Os each. Considering output entries in our hard instance, these contain elementary products.
Since outputs are needed either in the direct or the indirect way, the number of I/Os needed becomes the minimum of the two lower bounds we get Theorem 1.2.
References
 [1] A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems. Commun. ACM, 31(9):1116–1127, 1988.
 [2] N. Alon, R. Yuster, and U. Zwick. Colorcoding. J. ACM, 42(4):844–856, July 1995.
 [3] N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 17(3):209–223, 1997.
 [4] R. R. Amossen, A. Campagna, and R. Pagh. Better size estimation for sparse matrix products. APPROX/RANDOM’10, pages 406–419, Berlin, Heidelberg, 2010. SpringerVerlag.
 [5] R. R. Amossen and R. Pagh. Faster joinprojects and sparse matrix multiplications. In Proceedings of the 12th International Conference on Database Theory, ICDT ’09, pages 121–126, New York, NY, USA, 2009. ACM.
 [6] M. Bender, G. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the I/Omodel. Theory of Computing Systems, 47(4):934–962, 2010.
 [7] R. S. Boyer and J. S. Moore. MJRTY  A fast majority vote algorithm. Technical Report AI8132, The University of Texas at Austin, Department of Computer Sciences, Feb. 1 1981.
 [8] E. Cohen. Estimating the size of the transitive closure in linear time. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, SFCS ’94, pages 190–200, Washington, DC, USA, 1994. IEEE Computer Society.
 [9] E. Cohen. Structure prediction and computation of sparse matrix products. Journal of Combinatorial Optimization, 2(4):307–332, 1998.
 [10] E. D. Demaine. Cacheoblivious algorithms and data structures. In Lecture Notes from the EEF Summer School on Massive Data Sets. BRICS, University of Aarhus, Denmark, June 27–July 1 2002.
 [11] P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. Journal of Computer and System Sciences.
 [12] G. Greiner and R. Jacob. The i/o complexity of sparse matrix dense matrix multiplication. In LATIN 2010: Theoretical Informatics, pages 143–156. Springer, 2010.
 [13] D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributedmemory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004.
 [14] H. JiaWei and H. T. Kung. I/o complexity: The redblue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, STOC ’81, pages 326–333, New York, NY, USA, 1981. ACM.
 [15] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twentyninth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’10, pages 41–52, New York, NY, USA, 2010. ACM.
 [16] A. McGregor. Algorithms for Signals, book draft. 2013.
 [17] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, New York, NY, USA, 1995.
 [18] K. Mulmuley, U. Vazirani, and V. Vazirani. Matching is as easy as matrix inversion. Combinatorica, 7(1):105–113, 1987.
 [19] R. Pagh. Compressed matrix multiplication. ACM Trans. Comput. Theory, 5(3):9:1–9:17, Aug. 2013.
 [20] R. Pagh and F. Silvestri. The input/output complexity of triangle enumeration. arXiv preprint arXiv:1312.0723, 2013.
 [21] A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Spaceround tradeoffs for mapreduce computations. In Proceedings of the 26th ACM international conference on Supercomputing, pages 235–244. ACM, 2012.
 [22] M. O. Rabin and V. V. Vazirani. Maximum matchings in general graphs through randomization. J. Algorithms, 10(4):557–567, Dec. 1989.
 [23] V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969.
 [24] S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.
 [25] R. Williams and H. Yu. Finding orthogonal vectors in discrete structures, chapter 135, pages 1867–1877.
 [26] V. V. Williams. Multiplying matrices faster than coppersmithwinograd. In Proceedings of the Fortyfourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 887–898, New York, NY, USA, 2012. ACM.