Asymptotically tight lower bounds are derived for the I/O complexity of a general class of hybrid algorithms computing the product of square matrices combining “Strassenlike” fast matrix multiplication approach with computational complexity , and “standard” matrix multiplication algorithms with computational complexity . We present a novel and tight lower bound for the I/O complexity a class of “uniform, nonstationary” hybrid algorithms when executed in a twolevel storage hierarchy with words of fast memory, where denotes the threshold size of subproblems which are computed using standard algorithms with algebraic complexity .
The lower bound is actually derived for the more general class of “nonuniform, nonstationary” hybrid algorithms which allow recursive calls to have a different structure, even when they refer to the multiplication of matrices of the same size and in the same recursive level, although the quantitative expressions become more involved. Our results are the first I/O lower bounds for these classes of hybrid algorithms. All presented lower bounds apply even if the recomputation of partial results is allowed and are asymptotically tight.
The proof technique combines the analysis of the Grigoriev’s flow of the matrix multiplication function, combinatorial properties of the encoding functions used by fast Strassenlike algorithms, and an application of the LoomisWhitney geometric theorem for the analysis of standard matrix multiplication algorithms. Extensions of the lower bounds for a parallel model with processors are also discussed.
1 Introduction
Data movement plays a critical role in the performance of computing systems, in terms of both time and energy. This technological trend [28] appears destined to continue, as physical limitations on minimum device size and on maximum message speed lead to inherent costs when moving data, whether across the levels of a hierarchical memory system or between processing elements of a parallel system [10]. While the communication requirements of algorithms have been widely investigated in literature, obtaining significant and tight lower bounds based on such requirements remains an important and challenging task.
In this paper, we focus on the I/O complexity of a general class of hybrid algorithms for the computing the product of square matrices which combine. Such algorithms combine fast algorithms with base case similar to Strassen’s matrix multiplication algorithm [35] with algebraic (or computational) complexity with standard (or classic) matrix multiplication algorithms with algebraic complexity . Further, these algorithms allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level and of the same input size. These algorithms are referred in literature as “nonuniform, nonstationary”. This class includes, for example, algorithms that optimize for input sizes [15, 16, 20]. Matrix multiplication is a pervasive primitive utilized in many applications.
While of actual practical importance, to the best of our knowledge, no characterization of the I/O complexity of such algorithms has presented before this work. This is likely due to the the fact that the irregular nature of hybrid algorithms and, hence, the irregular structure of the corresponding Computational Directed Acyclic Graphs (CDAGs), complicates the analysis of the combinatorial properties of the CDAG which is the foundation of many of I/O lower bound technique presented in literature (e.g., [8, 18, 29]).
The technique used in this work overcomes such challenges and yields asymptotically tight I/O lower bounds which hold even if recomputation of intermediate values is allowed.
Previous and Related Work:
Strassen [35] showed that two matrices can be multiplied with operations, where , hence with asymptotically fewer than the arithmetic operations required by the straightforward implementation of the definition of matrix multiplication. This result has motivated a number of efforts which have lead to increasingly faster algorithms, at least asymptotically, with the current record being at [24].
I/O complexity has been introduced in the seminal work by Hong and Kung [18]; it is essentially the number of data transfers between the two levels of a memory hierarchy with a fast memory of words and a slow memory with an unbounded number of words. Hong and Kung presented techniques to develop lower bounds to the I/O complexity of computations modeled by computational directed acyclic graphs (CDAGs). The resulting lower bounds apply to all the schedules of the given CDAG, including those with recomputation, that is, where some vertices of the CDAG are evaluated multiple times. Among other results, they established a lower bound to the I/O complexity of standard, definitionbased matrix multiplication algorithms, which matched a known upper bound [13]. The techniques of [18] have also been extended to obtain tight communication bounds for the definitionbased matrix multiplication in some parallel settings [4, 21, 33] and for the special case of “sparse matrix multiplication” [27]. Ballard et al. generalized the results on matrix multiplication of Hong and Kung [18] in [7, 6] by using the approach proposed in [21] based on the LoomisWhitney geometric theorem [25, 36].
In an important contribution, Ballard et al. [8], obtained an I/O lower bound for Strassen’s algorithm, using the “edge expansion approach”. The authors extend their technique to a class of “Strassenlike” fast multiplication algorithms and to fast recursive multiplication algorithms for rectangular matrices [5]. This result was later generalized to increasingly broader classes of “Strassenlike” algorithms by Scott et. al [31] using the “path routing” technique, and De Stefani [14] using a combination the concept of Grigoriev’s flow of a function and the “dichotomy width” technique [9]. While the previously mentioned results hold only under the restrictive assumption that no intermediate result may be more than once (i.e., the norecomputation assumption), in [11] Bilardi and De Stefani introduced the first asymptotically tight I/O lower bound which holds if recomputation is allowed. Their technique was later extended to the analysis of Strassenlike algorithms with base case [26], and to the analysis of ToomCook integer multiplication algorithms [12].
A parallel, “communication avoiding” implementation of Strassen’s algorithm whose performance matches the known lower bound [8, 31], was proposed by Ballard et al. [3]. A communication efficient algorithm for the special case of sparse matrices based on Strassen’s algorithm was presented in [22].
In [32], Scott derived a lower bound for the I/O complexity of a class of uniform, nonstationary algorithms combining Strassenlike algorithm with recursive standard algorithms. This result holds only under the restrictive norecomputation assumption.
To the best of our knowledge, ours is the first work presenting asymptotically tight I/O lower bounds for nonuniform, nonstationary hybrid algorithms for matrix multiplication that hold when recomputation is allowed.
Our results:
We present the first I/O lower bound for a class of nonuniform, nonstationary hybrid matrix multiplication algorithms when executed in a twolevel storage hierarchy with words of fast memory. Algorithms in combine fast Strassenlike algorithms with base case with algebraic complexity , and standard algorithms based on the definition with algebraic complexity . These algorithms allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level and of the same input size. The result in Theorem 7 relates the I/O complexity of algorithms in to the number and the input size of an opportunely selected set of the subproblems generated by the algorithms themselves.
We also present, in Theorem 9, a novel lower bound for the I/O complexity of algorithms in a subclass of composed by uniform nonstationary hybrid algorithms where denotes the threshold size of subproblems which are computed using standard algorithms with algebraic complexity .
The previous result by Scott [32] covers only a subclass of composed by uniform, nonstationary algorithms combining Strassenlike algorithms with the recursive standard algorithm, and holds only assuming that no intermediate value is recomputed. Instead, all our bounds allow for recomputation of intermediate values and are asymptotically tight. As the matching upper bounds do not recompute any intermediate value, we conclude that using recomputation may reduce the I/O complexity of the considered classes of hybrid algorithms by at most a constant factor.
Our proof technique is of independent interest since it exploits to a significant extent the “divide and conquer” nature exhibited by many algorithms. Our approach combines elements from the “Gflow” I/O lower bound technique originally introduced by Bilardi and De Stefani, with an application of the LoomisWhitney geometric theorem, which has been used by Irony et al. to study the I/O complexity of standard matrix multiplication algorithms [21], to recover an information which relates to the concept of Minimum set introduced in Hong and Kung’s method. We follow the dominator set approach pioneered by Hong and Kung in [18]. However, we focus the dominator analysis only on a select set of target vertices, which, depending on the algorithm structure, correspond either to the outputs of the subCDAGs that correspond to subproblems of a suitable size (i.e., chosen as a function of the fast memory capacity ) computed using a fast Strassenlike algorithm, or to the the vertices corresponding to the elementary products evaluated by a standard (definition) matrix multiplication algorithm.
We derive our main results for the hierarchical memory model (or external memory model). Our results generalize to parallel models with P processors. For these parallel models, we derive lower bounds for the “bandwith cost”, that is for the number of messages (and, hence, the number of memory) that must be either sent or received by at least one processor during the CDAG evaluation.
Paper organization:
In Section 2 we outline the notation and the computational models used in the rest of the presentation. In Section 3 we rigorously define the class of hybrid matrix multiplication algorithms being considered. In Section 4 we discuss the construction and important properties of the CDAGs corresponding to algorithms in . In Section 5 we introduce the concept of Maximal SupProblem (MSP) and describe their properties which lead to the I/O lower bounds for algorithms in in Section 6.
2 Preliminaries
We consider algorithms that compute the product of two square matrices with entries from a ring . We use to denote the set variables each corresponding to an entry of matrix . We refer to the number of entries of a matrix as its “size” and we denote it as . We denote the entry on the th row of the th column of matrix as .
In this work we focus on algorithms whose execution can be modeled as a computational directed acyclic graph (CDAG) . Each vertex represents either an input value or the result of a unittime operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. For example, each of the input (resp., output) vertices of corresponds to one of the entries of the factor matrices and (resp., to the entries of the product matrix ). The directed edges in represent data dependencies. That is, a pair of vertices are connected by an edge directed from to if and only if the value corresponding to is an operand of the unit time operation which computes the value corresponding to . A directed path connecting vertices is an ordered sequence of vertices starting with and ending with , such that there is an edge in directed from each vertex in the sequence to its successor.
We say that is a subCDAG of if and . Note that, according to this definition, every CDAG is a subCDAG of itself. We say that two subCDAGs and of are vertex disjoint if . Analogously, two directed paths in are vertex disjoint if they do not share any vertex.
When analyzing the properties of CDAGs we make use of the concept of dominator set originally introduced in [18]. We use the following – slightly different – definition:
Definition 1 (Dominator set).
Given a CDAG , let denote the set of its input vertices. A set is a dominator set for with respect to if every path from a vertex in to a vertex in contains at least a vertex of . When , is simply referred as “a dominator set for ”.
I/O model and machine models:
We assume that sequential computations are executed on a system with a twolevel memory hierarchy, consisting of a fast memory or cache of size (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in cache. We assume that each entry of the input and intermediate results matrices (including entries of the output matrix) is maintained in a single memory word (the results trivially generalize if multiple memory words are used).
Data can be moved from the slow memory to the cache by read operations and, in the other direction, by write operations. These operations are also called I/O operations. We assume the input data to be stored in slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed by means of the “redblue pebble game” [18]. The number of I/O operations executed when evaluating a CDAG depends on the “computational schedule”, that is, it depends on the order in which vertices are evaluated and on which values are kept in/discarded from cache.
The I/O complexity of a CDAG is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the “External Memory Model” by Aggarwal and Vitter [2], where values can be moved between cache and consecutive slow memory locations with a single I/O operation. For , this model clearly reduces to the redblue pebble game.
Given an algorithm , we only consider “parsimonious execution schedules”, that is schedules such that: (i) each time an intermediate result (excluding the output entries of ) is computed, such value is then used to computed to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory); and (ii) any time an intermediate result is read from slow to cache memory, such value is then used to computed to compute at least one of the values of which it is an operand before being removed from the memory or moved back to slow memory using a write I/O operation. Clearly, any nonparsimonious schedule can be reduced to a parsimonious schedule by removing all the steps which violate the definition of parsimonious computation. has therefore less computational and I/O operations than . Hence, restricting the analysis to parsimonious computations leads to no loss of generality.
We also consider a parallel model where processors, each with a local memory of size , are connected by a network. We do not, however, make any assumption on the initial distribution of the input data nor regarding the balance of the computational load among the processors. Processors can exchange pointtopoint messages, with every message containing up to memory words. For this parallel model, we derive lower bounds for the number of messages that must be either sent or received by at least one processor during the CDAG evaluation. The notion of “parsimonious execution schedules” straightforwardly extends to this parallel model.
3 Hybrid matrix multiplication algorithms
In this work, we consider a family of hybrid matrix multiplication algorithms obtained by hybridizing the two following classes of algorithms:
Standard matrix multiplication algorithms: This class includes all the square matrix multiplication algorithms which, given the input factor matrices , satisfy the following properties:

The elementary products , for , are directly computed;

Each of the is computed by summing the values of the elementary products , for , through a summation tree by additions and subtractions only;

The evaluations of the ’s are independent of each other. That is, internal vertex sets of the summation trees of all the ’s are disjoint from each other.
This class, also referred in literature as classic, naive or conventional algorithms, correspond to that studied for the by Hong and Kung [18] (for the sequential setting) and by Irony et al. [21] (for the parallel setting).
Algorithms in this class have computational complexity . This class includes, among others, the sequential iterative definition algorithm, the sequential recursive divide and conquer algorithm based on block partitioning, and parallel algorithms such as Cannon’s “2D” algorithm [13], the “2.5D” algorithm by Solomonik and Demmel [34], and “3D” algorithms [1, 23].
Fast Strassenlike matrix multiplication algorithms with base case : This class includes algorithms following a structure similar to that of Strassen’s [35] (see Appendix B) and Winograd’s variation [37] (which reduces the leading coefficient of the arithmetic complexity reduced from 7 to 6). Algorithms in this class generally follow three steps:

Encoding: Generate the inputs, of size of seven subproblems, as linear sums of the input matrices;

Recursive multiplications: Compute (recursively) the seven generated matrix multiplication subproducts;

Decoding: Computing the entries of the product matrix via linear combinations of the output of the seven subproblems.
Algorithms in this class have algebraic complexity , which is optimal for algorithms with base case [37].
Remarkably, the only properties of relevance for the analysis of the I/O complexity of algorithms in these classes are those used in the characterization of the classes themselves.
In this work we consider a general class of nonuniform, nonstationary hybrid square matrix multiplication algorithms, which allow mixing of schemes from the fast Strassenlike class with algorithms from the standard class. Given an algorithm let denote the problem corresponding to the computation of the product of the input matrices and . Consider an “instruction function” , which, given as input returns either (a) indication regarding the algorithm from the standard class which is to be used to compute , or (b) indication regarding the fast Strassenlike algorithm to be used to recursively generate seven subproblems and the instruction functions for each of the seven subproblems. We refer to the class of nonuniform, nonstationary algorithms which can be characterized by means of such instruction functions as . Algorithms in allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level. E.g., some of the subproblems with the same size may be computed using algorithms form the standard class while others may be computed using recursive algorithms from the fast class. This class includes, for example, algorithms that optimize for input sizes, (for sizes that are not an integer power of a constant integer).
We also consider a subclass of constituted by uniform, nonstationary hybrid algorithms which allow mixing of schemes from the fast Strassenlike class for the initial recursion levels, and then cut the recursion off once the size the generated subproblems is smaller or equal to a set threshold , and switch to using algorithm form the standard class. Algorithms in this class are uniform, i.e., subproblems of the same size are all either recursively computed using a scheme form the fast class, or are all computed using algorithms from the standard class.
This corresponds to actual practical scenarios, as the use of Strassenlike algorithms is mostly beneficial for large input size. As the size of the input of the recursively generated subproblems decreases, the asymptotic benefit of fast algorithms is lost due to the increasing relative impact of the constant multiplicative factor, and algorithms in the standard class exhibit lower actual algebraic complexity. For a discussion on such hybrid algorithms and their implementation issues we refer the reader to [16, 20] (sequential model) and [15] (parallel model).
4 The CDAG of algorithms in
Let denote the CDAG that corresponds to an algorithm used to multiply input matrices . The challenge in the characterization of comes from the fact that rather than considering to a single algorithm, we want to characterize the CDAG corresponding to the class . Further, the class is composed by a rich variety of vastly different and irregular algorithm. Despite such variety, we show a general template for the construction of and we identify some of it properties which crucially hold regardless of the implementation details of and, hence, of .
Construction:
can be obtained by using a recursive construction that mirrors the recursive structure of the algorithm itself. Let denote the entire matrix multiplication problem computed by . Consider the case for which, according to the instruction function , is to be computed using an algorithm from the standard class. As we do not fix a specific algorithm, we do not correspondingly have a fixed CDAG. The only feature of interest for the analysis is that, in this case, the CDAG corresponds to the execution of an algorithm from the standard class for input matrices of size .
Consider instead the case for which, according to , is to be computed using an algorithm from the fast class. In the base case for the problem is computed without generating any further subproblems. As an example, we present in Figure 0(a) the base case for Strassen’s original algorithm [35]. If , then specifies the divide and conquer scheme to be used to generate the seven subproblems , and the instruction function for each of them. The subCDAGs of corresponding to each of the seven subproblems , denoted as are constructed according to , following recursively the steps discussed previously. can then be constructed by composing the seven subCDAGs . disjoint copies of an encoder subCDAG (resp., ) are used to connect the input vertices of , which correspond to the values of the input matrix (resp., ) to the appropriate input vertices of the seven subCDAGs ; the output vertices of the subCDAGs (which correspond to the outputs of the seven subproducts) are connected to the appropriate output vertices of the entire CDAG using copies of the decoder subCDAG . We present an example of such recursive construction in Figure 0(b).
Properties of :
While the actual internal structure , and, in particular, the structure of encoder and decoder subCDAGs depends on the specific Strassenlike algorithm being used by , all versions share some properties of great importance. Let denote an encoder CDAG for a fast multiplication algorithm base case, with (resp., ) denoting the set of input (resp., output) vertices, and denoting the set of edges directed from to .
Lemma 1 (Lemma 3.3 [26]).
Let denote an encoder graph for a fast matrix multiplication algorithm with base case . There are no two vertices in Y with identical neighbors sets.
While the correctness of this Lemma can be simply verified by inspection in the case of Strassen’s algorithm [35], Lemma 1 generalizes the statement to all encoders corresponding to fast matrix multiplication algorithms with base case . From Lemma 1 we have:
Lemma 2.
Let and let and be any two subproblems generated by with input size greater than , such that is not recursively generated while computing and vice versa. Then, the subCDAGs of corresponding, respectively, to and to are vertex disjoint.
The following lemma, originally introduced for Strassen’s algorithm in [11] and then generalized for Strassenlike algorithms with base case in in [26], captures a connectivity property of encoder subCDAGs.
Lemma 3 (Lemma 3.1 [26]).
Given an encoder CDAG for any Strassenlike algorithm with base case , for any subset of its output vertices, there exists a subset of its input vertices, with , such that there exist vertexdisjoint paths connecting the vertices in to vertices in .
5 Maximal subproblems and their properties
For an algorithm , let denote a subproblem generated by . In our presentation we consider the entire matrix multiplication problem an improper supproblem generated by . We refer as the ancestor subproblems of as the sequence of subproblems generated by such that was recursively generated to compute for , and such that was recursively generated to compute . Clearly, if is the entire problem, then has no ancestors.
Towards studying the I/O complexity of algorithms in we focus on the analysis of a particular set of subproblems.
Definition 2 (Maximal SubProblems (MSP)).
Let be an algorithm used to multiply matrices . If we say that does not generate any Maximal SubProblem (MSP).
Let be a subproblem generated by with input size , with , and such that all its ancestors subproblems are computed, according to using algorithms from the fast class. We say that:

is a Type 1 MSP of if, according to , is computed using an algorithm from the standard class. If the entire problem is to be solved using an algorithm for the standard class, we say that the entire problem is the unique (improper) Type 1 MSP generated by .

is a Type 2 MSP of if, according to , is computed by generating 7 subproblems according to the recursive scheme corresponding to an algorithm from the fast (Strassenlike) class, and if the generated subproblems have input size strictly smaller than . If the entire problem uses a recursive algorithm from the fast class to generate 7 subproblems with input size smaller than , we say that the entire problem is the unique, improper, Type 2 MSP generated by .
In the following we denote as (resp., ) the number of Type 1 (resp., Type 2) MSPs generated by . Let denote the th MSP generated by and let denote the corresponding subCDAG of . We denote as and (resp., ) the input factor matrices (resp., the output product matrix) of .
Properties of MSPs and their corresponding subCDAGs:
By Definition 2, we have that for each pair of distinct MMSPs and , is not recursively generated by in order to compute or vice versa. Hence, by Lemma 2, the subCDAGs of that correspond each to one of the MSPs generated by are vertex, disjoint.
In order to obtain our I/O lower bound for algorithms in , we characterize properties regarding the minimum dominator size of an arbitrary subset of and .
Lemma 4 (Proof in Appendix a.1).
Let be the CDAG corresponding to an algorithm which admits Type 1 MSPs. For each Type 1 MSP let denote the set of input vertices of the associated subCDAG which correspond each to an entry of the input matrices and . Further, we define .
Let in such that , with , for , and such that if and only if .^{1}^{1}1Here we use as convention that . Any dominator set of satisfies .
Lemma 5 (Proof in Appendix a.2).
Let be the CDAG corresponding to an algorithm which admits Type 2 MSPs. Further let denote the set of vertices corresponding to the entries of the output matrices of the Type 2 MSPs. Given any subset in with , any dominator set of satisfies .
For each Type 1 MSP generated by , with input size^{2}^{2}2In general, different Type 1 MSP may have different input sizes , we denote as the set of variables whose value correspond to the elementary products for . Further, we denote as the set of vertices corresponding to the variables in , and we define .
Lemma 6 (Proof in Appendix a.3).
For any Type 1 MSPs generated by consider . Let (resp., ) denote a subset of the vertices corresponding to entries of (resp., ) which are multiplied in at least one of the elementary products in . Then any dominator of the vertices corresponding to with respect to the the vertices in is such that
6 I/O lower bounds for algorithm in and
Theorem 7.
Let be an algorithm to multiply two square matrices . If run on a sequential machine with cache of size and such that up to memory words stored in consecutive memory locations can be moved from cache to slow memory and vice versa using a single memory operation, ’s I/O complexity satisfies:
(1) 
for , where denotes the total number of internal elementary products computed by the Type 1 MSPs generated by and denotes the total number of Type 2 MSPs generated by .
If run on processors each equipped with a local memory of size and where for each I/O operation it is possible to move up to words, ’s I/O complexity satisfies:
(2) 
Proof.
We prove the result in (1) (resp., (2)) for the case (resp., ). The result then trivially generalizes for a generic (resp., ). We first prove the result for the sequential case in in (1). The bound for the parallel case in (2) will be obtained as a simple generalization.
The fact that follows trivially from the fact that as in our model the input matrices and are initially stored in slow memory, it will necessary to move the entire input to the cache at least once using at least I/O operations. If does not generate any MSPs the statement in (1) is trivially verified. In the following, we assume .
Let denote the CDAG associated with algorithm according to the construction in Section 4. By definition, and from Lemma 2, the subCDAGs of corresponding each to one of the MSPs generated by are vertexdisjoint. Hence, the ’s are a partition of and .
By Definition 2, the MSP generated by have input (resp., output) matrices of size greater or equal to . Recall that we denote as the set of vertices which correspond to the outputs of the Type 2 MSPs, we have .
Let be any computation schedule for the sequential execution of using a cache of size . We partition into nonoverlapping segments such that during each either (a) exactly distinct values ^{3}^{3}3For simplicity of presentation, we assume . corresponding to vertices in , denoted as , are explicitly computed (i.e., not loaded from slow memory), or (b) distinct values corresponding to vertices in (denoted as ) are evaluated for the first time. Clearly there are at least such segments.
Below we show that the number of I/O operations executed during each satisfies for case (a) and for case (b), from which the theorem follows.
Case (a): For each Type 1 MSP let . As the subCDAGs corresponding each to one of the Type 1 MSPs are vertexdisjoint, so are the sets . Hence, the ’s constitute a partition of . Let and (resp., ) denote the input matrices (resp., output matrix) of with , and let and (resp., ) denote the set of the variables corresponding to the entries of and (resp., ). Further, we denote as the set of values corresponding to the vertices in . For , we say that is “active during ” if any of the elementary multiplications , for , correspond to any of the vertices in . Further we say that a (resp., ) is “accessed during ” if any of the elementary multiplications (resp., ), for , correspond to any of the vertices in .
Our analysis makes use of the following property of standard matrix multiplication algorithms:
Lemma 8 (LoomisWhitney inequality [21, Lemma 2.2]).
Let (resp., ) denote the set of vertices corresponding to the entries of (resp., ) which are accessed during , and let denote the set of vertices corresponding to the entries of which are active during . Then
(3) 
Lemma 8 is a reworked version of a property originally presented by Irony et al. [21, Lemma 2.2], which itself is a consequence of the LoomisWhitney geometric theorem [25].
Let be active during . In order to compute entirely during (i.e., without using partial accumulation of the summation ), it will be necessary to evaluate all the elementary products , for , during itself. Thus, at most entries of can be entirely computed during .
Let denote an entry of which is active but not entirely computed during . There are two possible scenarios:

is computed during : The computation thus requires for a partial accumulation of to have been previously computed and either held in the cache at the beginning of , or to moved to cache using a read I/O operation during ;

is not computed during : As is a parsimonious computation, the partial accumulation of obtained from the elementary products computed during must either remain in the cache at the end of , or be moved to slow memory using a write I/O operation during ;
In both cases, any partial accumulation either held in memory at the beginning (resp., end) of or read from slow memory to cache (resp., written from cache to slow memory) during is, by definition, not shared between multiple entries in .
Let denote the subCDAG of corresponding to the Type 1 MSP . In the following, we refer as to the set of vertices of corresponding to the values of such partial accumulators. For each of the least entries of which are active but not entirely computed during , either one of the entries of the cache must be occupied at the beginning of , or one I/O operation is executed during . Let . As, by Lemma 2, the subCDAGs corresponding to the Type 1 MSPs are vertex disjoint, so are the the sets . Let . We have:
(4) 
where the last passage follows from the fact that, by Definition 2, .
From Lemma 8, the set of vertices (resp., ) which correspond to entries of (resp., ) which are accessed during satisfies . Hence, at least entries from the input matrices of are accessed during . Let denote the set of vertices corresponding to the entries of the input matrices of . From Lemma 6 we have that there exists a set , with, such that the vertices in are connected by vertex disjoint pats to the vertices in . Let . As, by Lemma 2, the subCDAGs corresponding to the Type 1 MSPs are vertex disjoint, so are the the sets for . Hence
From Lemma 4 any dominator of , must be such that
Hence, we can conclude that any dominator of must be such that
(5) 
Consider the set of vertices of corresponding to the at most values stored in the cache at the beginning of and to the at most values loaded into the cache form the slow memory (resp., written into the slow memory from the cache) during by means of a read (resp., write) I/O operation. Clearly, .
In order for the values from to be computed during there must be no path connecting any vertex in , and, hence, , to any input vertex of which does not have at least one vertex in , that is has to admit a subset such that is a dominator set of . Note that, as the values corresponding to vertices in are actually computed during (i.e., not loaded from memory using a read I/O operation), does not include vertices in itself. Further, as motivated in the previous discussion, must include all the vertices in the set corresponding to values of partial accumulators of the active output values of Type 1 MSPs during .
By construction, and are vertex disjoint. Hence, from (4) and (5) we have:
As, by construction, , we have:
(6) 
By studying its derivative after opportunely accounting for the minimum, we have that (6)
is minimized for . Hence we have:
.
Whence , which implies , as stated above.
Case (b): In order for the values from to be computed during there must be no path connecting any vertex in to any input vertex of which does not have at least one vertex in , that is has to be a dominator set of .
From Lemma 5, any dominator set of any subset with satisfies , whence , which implies as stated above. This concludes the proof for the sequential case in (1).
The proof for the bound for the parallel model in (2), follows from the observation that at least one of the processors, denoted as , must compute at least values corresponding to vertices in or values corresponding to vertices in (or both). The bound follows by applying the same argument discussed for the sequential case to the computation executed by . ∎
Note that if is such that the product is entirely computed using an algorithm from the standard class (resp., a fast matrix multiplication algorithm), the bounds of Theorem 7 corresponds asymptotically to the results of [18] for the sequential case and [21] for the parallel case (resp., the results in [11]).
The bound in (2) can be further generalized to a slightly different model in which each of the processors is equipped with a cache memory of size and a slow memory of unbounded size. In such case, the I/O complexity of the algorithms in corresponds to the total number of both messages received and the number of I/O operations used to move data from cache to slow memory and vice versa (i.e., read and write) executed by at least one of the processors.
I/O lower bound for uniform, non stationary algorithms in :
For the subclass of uniform, non stationary algorithms , given the values of , and is possible to compute a closed form expression for the values of and .^{4}^{4}4The constants terms in Theorem 9 assume that and are powers of two. If that not the case the statement holds with minor adjustments. Then, by applying Theorem 7 we have:
Theorem 9.
Let be an algorithm to multiply two square matrices . If run on a sequential machine with cache of size and such that up to memory words stored in consecutive memory locations can be moved from cache to slow memory and vice versa using a single memory operation, ’s I/O complexity satisfies:
(7) 
If run on processors each equipped with a local memory of size and where for each I/O operation it is possible to move up to memory words, ’s I/O complexity satisfies:
(8) 
Proof.
The statement follows by bounding the values , and for , and by applying the general result from Theorem 7. In order to simplify the presentation, in the following we assume that the values are powers of two. If that is not the case, the theorem holds with some minor adjustments to the constant multiplicative factor.
Let let be the smallest value in such that . By definition of , at each of the recursive levels generates subproblems of size .

If , generates
Type 1 MSP each with input size . As, by Definition 2, the Type 1 MSP are input dijoint we have:

Otherwise, if , generates
Type 2 MSP each with input size