Improving the numerical stability of
fast matrix multiplication
Abstract
Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen’s original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen’s algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit.
We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.
Key words. Practical fast matrix multiplication, error bounds, diagonal scaling
1 Introduction
After Strassen’s discovery of an algorithm for dense matrixmatrix multiplication in 1969 [25] that reduced the computational complexity from the classical (for multiplying two matrices) to , there has been extensive effort to understand fast matrix multiplication, based on algorithms with computational complexity exponent less than 3. From a theoretical perspective, there remains a gap between the best known lower bound [21] and best known upper bound [14] on the exponent. From a practical perspective, it is unlikely that the techniques for obtaining the best upper bounds on the exponent can be translated to practical algorithms that will execute faster than the classical one for reasonably sized matrices. In this paper, we are interested in the numerical stability of practical algorithms that have been demonstrated to outperform the classical algorithm (as well as Strassen’s in some instances) on modern hardware [3].
Nearly all fast matrix multiplication algorithms are based on recursion, using a recursive rule that defines a method for multiplying matrices of fixed dimension by scalar multiplications. In this work, we use the notation for these algorithms. For practical algorithms, these fixed dimensions need to be very small, typically , as they define the factors by which the dimensions of subproblems are reduced within the recursion. Many such algorithms have been recently discovered [3, 24]. Most fast algorithms share a common bilinear structure and can be compactly represented by three matrices that we denote by , following the notation of Bini and Lotti [4]. Many key properties of the practicality of an algorithm, including its numerical stability, can be derived quickly from its representation. We also note that, because recursive subproblems are again matrix multiplications, different recursive rules can be combined arbitrarily. Following the terminology of Ballard et al. [2] and Demmel et al. [12], we refer to algorithms that vary recursive rules across different recursive levels and within each level as nonuniform, nonstationary algorithms. If an algorithm uses the same rule for every subproblem in each recursive level but varies the rule across levels, we call it a uniform, nonstationary algorithm; those defined by only one rule are called stationary algorithms.
Fast matrix multiplication is known to yield larger numerical errors than the classical algorithm. The forward error guarantee for the classical algorithm is componentwise: the error bound for each entry in the output matrix depends only on the dot product between the corresponding row and column of the input matrices. Fast algorithms perform computations involving other input matrix entries that do not appear in a given dot product (their contributions eventually cancel out), and therefore the error bounds for these algorithms depend on more global properties of the input matrices. Thus, fast algorithms with no modification are known to exhibit socalled normwise stability [4] (sometimes referred to as Brent stability [23]) while the classical algorithm exhibits the stronger componentwise stability, which is unattainable for fast algorithms [23].
Our main goals in this paper are to explore means for improving the theoretical error bounds of fast matrix multiplication algorithms as well as to test the improvements with numerical experiments, focusing particularly on those algorithms that yield performance benefits in practice. For computing , where is and is , normwise stability bounds for full recursion take the following form:
where is the maxnorm, is the machine precision, and is a polynomial function that depends on the algorithm [4, 12, 16].^{1}^{1}1Here and elsewhere, the term hides dependence on dimensions and norms of input matrices. For example, for the classical algorithm, with no assumption on the ordering of dot product computations. We note that is independent of the input matrices, and is independent of the algorithm. In this paper, we explore ways of improving each factor separately. Our main contributions include:

generalizing and tightening previous error analysis of stationary fast algorithms to bound in terms of the number of recursive steps used and two principal quantities derived from ;

presenting and comparing the stability quantities of recently discovered practical algorithms;

exploring means of improving algorithmic stability through algorithm selection and nonuniform, nonstationary combination of algorithms;

presenting diagonal scaling techniques to improve accuracy for inputs with entries of widely varying magnitudes; and

showing empirical results of the effects of the various improvement techniques on both error and performance.
The structure of the remainder of the paper is as follows. We describe related work in LABEL:sec:rel_work and introduce our notation for fast matrix multiplication algorithms in LABEL:sec:alg. LABEL:sec:bounds presents the error analysis for bounding for general fast algorithms, and LABEL:sec:selection discusses the implications of the bounds on known practical algorithms. We present diagonal scaling techniques in LABEL:sec:scaling, showing how to reduce the contribution of the input matrices to the error bound. Finally, we discuss our results in LABEL:sec:discussion and provide directions for improving implementations of fast matrix multiplication algorithms.
2 Related Work
Bini and Lotti [4] provide the first general error bound for fast matrix multiplication algorithms, and their analysis provides the basis for our results. Demmel et al. [12] generalize Bini and Lotti’s results and show that all fast algorithms are stable. A more complete summary of the numerical stability of fast algorithms, with a detailed discussion of Strassen’s algorithm along with Winograd’s variant, appears in Higham’s textbook [16, Chapter 23]. We discuss these previous works in more detail and compare them to our error bounds in LABEL:sec:bounds.
Castrapel and Gustafson [8] and D’Alberto [9] discuss means of improving the numerical stability of Strassen’s algorithm (and Winograd’s variant) using the flexibility of nonuniform, nonstationary algorithms. Castrapel and Gustafson propose general approaches to such algorithms, and D’Alberto provides a specific improvement in the case of two or more levels of recursion.
Smirnov [24] describes strategies for discovering practical fast algorithms and presents several new algorithms, including a rank23 algorithm for with the fewest known nonzeros and an algorithm for that yields a better exponent than Strassen’s. Similar techniques are used by Benson and Ballard [3], and they demonstrate performance improvements over the classical and Strassen’s algorithm for both singlethreaded and sharedmemory multithreaded implementations. Laderman et al. [20] and later Kaporin [18, 19] consider another form of practical algorithms, ones that can achieve fewer floating point operations than the StrassenWinograd variant for certain matrix dimensions. Kaporin demonstrates better numerical stability than StrassenWinograd and shows comparable performance. However, because the base case dimensions proposed are relatively large (e.g., 13 or 20), we suspect that the performance will not be competitive on today’s hardware. Further, because the representations are not readily available, we do not consider these types of algorithms in this work.
Dumitrescu [13] proposes a form of diagonal scaling to improve the error bounds for Strassen’s algorithm. We refer to his approach as outside scaling and discuss it in more detail in LABEL:sec:scaling. Higham [16] points out that inside scaling can also affect the error bound but does not propose a technique for improving it. Demmel et al. [11] and Ballard et al. [1] state (without proof) improved error bounds using either inside or outside diagonal scaling.
3 Fast Matrix Multiplication Algorithms
Fast algorithms for matrix multiplication are those that perform fewer arithmetic operations than the classical algorithm in an asymptotic sense, achieving a computational complexity exponent less than 3 for the square case. We consider such fast algorithms to be practical if they have been (or likely can be) demonstrated to outperform the most efficient implementations of the classical algorithm on current hardware [3]. From a practical viewpoint, because matrices arising in current applications have limited size, we can consider a fast algorithm’s recursive rule being applied only a few times. In light of this viewpoint, we state our algorithms (and error bounds) in terms of the number of recursive levels rather than the dimension of the base case, where the number of recursive levels need not be considered a fixed quantity. In the rest of this section, we state the notational and terminology of the fast algorithms we consider in this paper.
3.1 Base Case Algorithms
A bilinear noncommutative algorithm that computes a product of an matrix and a matrix () using nonscalar (active) multiplications is determined by a matrix , a matrix , and a matrix such that
(1) 
for . Here, the single indices of entries of and assume columnmajor order, the single indices of entries of assume rowmajor order, and signifies an active multiplication. We will refer to the dimensions of such an algorithm with the notation , the rank of the algorithm by , and the set of coefficients that determine the algorithm with the notation .
3.2 Stationary Algorithms
Now we consider multiplying an matrix by a matrix . We will assume that , , and are powers of , , and ; otherwise, we can always pad the matrices with zeros and the same analysis will hold. The fast algorithm proceeds recursively by first partitioning into submatrices of size and into submatrices of size and then following LABEL:eqn:bilinear_alg by matrix blocks, i.e.,
(2) 
for , where signifies a recursive call to the algorithm. Here, we are using single subscripts on matrices as an index for the column or rowmajor ordering of the matrix blocks. The algorithms in this class of fast matrix multiplication are called stationary algorithms because they use the same algorithm at each recursive step [12]. However, we do not assume that stationary algorithms recurse all the way to a base case of dimension 1; we assume only that the base case computation (of whatever dimension) is performed using the classical algorithm. Thus, a stationary algorithm is defined by the triplet of matrices along with a number of recursive levels used before switching to the classical algorithm.
3.3 Uniform, NonStationary Algorithms
In contrast to the stationary algorithms discussed above, uniform, nonstationary algorithms employ a different fast algorithm, in the sense of LABEL:eqn:recursive_alg and LABEL:eqn:bilinear_alg, at each recursive level [2]. The fast algorithm is the same at a given recursive level. Specifically, we will consider uniform, nonstationary algorithms with steps of recursion, so the algorithm is specified by matrices , , of dimensions , , , for .
Uniform, nonstationary algorithms are of particular interest for maximizing performance. The fastest algorithm for a particular triplet of dimensions , , and may depend on many factors; the same algorithm may not be optimal for the recursive subproblems of smaller dimensions. Assuming performance is fixed for a given triplet of dimensions, the flexibility of nonstationary algorithms allows for performance optimization over a given set of fast algorithms. However, in parallel and more heterogeneous settings, better performance may be obtained by the greater generality of nonuniform, nonstationary algorithms, described in the next section.
3.4 NonUniform, NonStationary Algorithms
The final class of matrix multiplication algorithms we consider are nonuniform, nonstationary algorithms. In contrast to the previous case, nonuniform, nonstationary algorithms use different algorithms within a single recursive level as well across recursive levels [2], though we restrict the dimension of the partition by the fast algorithms at a given recursive level to be the same. To define such algorithms, we must specify for every node in the recursion tree, a total of recursive rules. We use superscript notation to denote a recursive node at level , in the toplevel subtree , second level subtree , and so on.
We demonstrate in LABEL:sec:non_uniform_non_stationary that the flexibility of these algorithms allows for an improvement in the numerical stability of multilevel recursive algorithms. We suspect that they also provide a performance benefit over stationary algorithms, though this has never been demonstrated empirically.
4 Error Analysis
The work by Bini and Lotti [4] provides the basic framework for the forward error analysis of fast matrix multiplication algorithms. They provide general bounds for any square, stationary bilinear algorithm with poweroftwo coefficients (so that there is no error in scalar multiplications), assuming that full recursion is used (a base case of dimension 1). Demmel et al. [12] extend the work of Bini and Lotti by (1) accounting for errors induced by the scalar multiplications in bilinear algorithms, (2) analyzing uniform, nonstationary bilinear fast matrix multiplication algorithms (algorithms that use different fast matrix multiplication routines at different levels of recursion), and (3) analyzing grouptheoretic fast matrix multiplication algorithms. The bounds provided by Demmel et al. also assume square algorithms and that full recursion is used. Higham [16] provides bounds for Strassen’s original algorithm as well as Winograd’s variant in terms of the base case dimension , where the recursion switches to the classical algorithm. Higham’s bounds are also slightly tighter (in the case of Strassen’s and Winograd’s algorithms) than the general bounds previously mentioned. We note that any matrix multiplication algorithm satisfying the componentwise error bound must perform at least arithmetic operations; that is, we cannot get the same componentwise error bounds even when using just one step of recursion of a fast algorithm [23].
The goal of the error analysis provided in this section is to generalize the previous work in two main directions and to tighten the analysis particularly in the case that nonzeros of , , and are not all . First, we will consider rectangular fast algorithms; that is, instead of considering recursive rules for multiplying two matrices, we consider the more general set of rules for multiplying an matrix by a matrix. Second, we will state our general bounds in terms of the number of levels of recursion used. Motivated by the results of recently discovered practical algorithms [3, 24], we would like to understand the theoretical error guarantees of an algorithm in terms of its representation. The recent performance results show that rectangular algorithms have practical value (even for multiplying square matrices) and that, for performance reasons, typically only a small number of recursive steps are used in practice. Several recently discovered practical algorithms include fractional poweroftwo coefficients (e.g., , , ), and we expect that other currently undiscovered useful algorithms will include fractional coefficients that are not powers of two. Therefore, we make no assumptions on the entries of , , and , and we derive principal quantities below that can be tighter than the analogous quantities in the previous works by Bini and Lotti [4] and Demmel et al. [12], particularly in the case of fractional coefficients. This sometimes leads to much sharper error bounds (see LABEL:ex:better_emax).
Finally, we point out that our representation of nonuniform, nonstationary algorithms is more convenient than previous work. Careful choices of nonuniform, nonstationary algorithms have been shown to improve the numerical stability over stationary approaches (see LABEL:ex:nunsStr) [9]. Bini and Lotti’s bounds [4] can be applied to such algorithms in terms of the global representation, but the size of that representation grows quickly with the number of recursive levels. Our representation (see LABEL:sec:nuns) and error bound (see LABEL:sec:non_uniform_non_stationary), given in terms of the base case rule used at each node in the recursion tree, allows for more efficient search of combinations of rules and has led to automatic discovery of more stable algorithms (see LABEL:ex:nuns323).
After defining the principal quantities of interest and specifying our model of computation, the rest of this section provides forward error bounds for each of the types of fast algorithms defined in LABEL:sec:alg. We warn the reader that there are notational similarities and (sometimes subtle) inconsistencies with previous work, as a result of our tightening of the analysis.
4.1 Principal quantities
Following the approach of Bini and Lotti [4], we identify two principal quantities associated with a fast algorithm that, along with the dimensions of the algorithm and the number of levels of recursion, determine its theoretical error bounds. The two principal quantities can be easily computed from the representation, and we define them in terms of the following vectors:
(3) 
(4) 
for and , where is the Booleanvalued indicator function with value 1 for true and 0 for false. That is, is the vector of numbers of nonzeros in the columns of , is the vector of numbers of nonzeros in the columns of , is the vector of numbers of nonzeros in the rows of , is the vector of column 1norms of , and is the vector of column 1norms of . When and have entries, and .
Definition 1.
The prefactor vector is defined entrywise by
(5) 
for , and the prefactor is defined as
Definition 2.
The stability vector is defined entrywise by
(6) 
for , and the stability factor is defined as
The principal quantities for several fast algorithms are listed in LABEL:tab:principal_quantities.
Bini and Lotti [4] provide a definition of for two different summation algorithms: sequential summation and serialized divideandconquer (see LABEL:sec:model). We choose the looser of these two bounds (sequential summation) for generality and simpler notation. However, our results are easily converted to the tighter case. Demmel et al.use the serialized divideandconquer algorithm in their analysis. Bini and Lotti’s analysis does not account for scalar (nonactive) multiplication by elements of , , and , so their parameter depends only on the nonzero structure, rather than the magnitude of the elements in these matrices (cf. LABEL:eqn:ab and LABEL:def:stability). Demmel et al.do account for the multiplication by elements of , , and . However, their parameter is identical to that of Bini and Lotti, and their bound includes an additional factor of , where is the number of recursive levels and is the maxnorm.
4.2 Model of arithmetic and notation
We follow the notation of Demmel et al. [12]. Let be the set of all errors bounded by (machine precision) and let . We assume the standard model of rounded arithmetic where the computed value of is for some . We use the set operation notation: , , and .
We define and note that as . Furthermore, we will not distinguish between singleton sets and an element when using this notation, e.g., . Finally, we will use the standard hat or notation to denote a computed value, e.g., or .
Under this arithmetic, the following fact for summation will be useful in our analysis
(7) 
where the algorithm for summation is simply to accumulate the terms one at a time, in sequential order. By using a serialized divideandconquer summation, we can also achieve
(8) 
For generality, we will assume the more pessimistic bound in LABEL:eqn:seq_summation_error. Our results can easily be modified for the error bounds in LABEL:eqn:dnc_summation_error.
We will also use the following property:
(9) 
4.3 Forward error analysis of stationary algorithms
The following theorem states the forward error bound for a stationary algorithm in terms of the principal quantities and defined in LABEL:sec:pq, which can be readily determined from its representation. The sources of error are floating point error accumulation and possible growth in magnitude of intermediate quantities. The floating point error accumulation depends in part on and grows at worst linearly in . The growth of intermediate quantities depends on and grows exponentially in , which typically dominates the bound. LABEL:tab:principal_quantities shows the values of these quantities for a variety of algorithms.
ref.  nnz  stab. exp.  

(classical)  8  8  24  4  2  1  
[25]  8  7  36  8  12  3.58  
[25]^{LABEL:footnote:strassen}  12  11  48  8  12  3.03  
[25]^{LABEL:footnote:strassen}  12  11  48  9  13  3.03  
[25]^{LABEL:footnote:strassen}  16  14  72  8  12  2.94  
[25]^{LABEL:footnote:strassen}  16  14  72  12  24  2.94  
LABEL:app:fast323  18  15  94  10  20  3.21  
LABEL:app:fast323  18  15  94  11  23  3.21  
[24]  27  23  139  15  29  3.07  
[3]  24  20  130  14  34  3.38  
[3]  24  20  130  14  30  3.38  
[3]  24  20  130  14  35  3.38  
LABEL:app:fast442  32  26  257  22  89  3.90  
LABEL:app:fast442  32  26  257  23  92  3.93  
[3]  36  29  234  23  100  3.66  
[3]  36  29  234  18  71  3.66  
[24]  54  40  960  39  428  4.69  
[24]  54  40  960  48  728.5  4.69 

These algorithms correspond to straightforward generalizations of Strassen’s algorithm, either using two copies of the algorithm or one copy of the algorithm combined with the classical algorithm.
Theorem 3.
Suppose that , where and is computed by using recursive steps of the fast matrix multiplication in LABEL:eqn:recursive_alg, with the classical algorithm used to multiply the matrices by the matrices at the base cases of the recursion. Then the computed matrix satisfies
where is the maxnorm.
Proof.
We begin by analyzing how relative errors propagate as we form the and matrices. Let a superscript index in brackets denote a matrix formed at the specified level of recursion. Following LABEL:eqn:seq_summation_error, we have the following error at the first recursive level:
where and are defined in LABEL:eqn:abg.
This error propagates as we recurse. At the th level of recursion, the inputs to the fast algorithm are given as sums of matrices and , each with a possible error of and , respectively, for some index sets and and some integers and . Following LABEL:eqn:seq_summation_error and LABEL:eqn:recursive_alg, the algorithm simply accumulates an additional factor of and before the matrices are passed to the subsequent level of recursion. Thus, at the th level of recursion, we have
(10) 
with . Note that in exact arithmetic,
(11) 
where and represent recursive orderings of the subblocks of and .
We now use the classical algorithm to multiply the computed and matrices at the leaves of the recursion. Because the inner dimension of each leaflevel matrix multiplication is , from LABEL:eqn:ST_error and LABEL:eqn:seq_summation_error we accumulate another factor of to obtain
where for .
Next, the computed matrices are added to form followingLABEL:eqn:recursive_alg. At the th level of recursion, sums of matrices , for appropriate index sets and including accumulated error for some integers , are added together to form the intermediate computed quantities . In the final step at the top of the recursion tree, we have
where is as defined in LABEL:eqn:abg. Following LABEL:eqn:add_accum_err, if for some integers , then
Likewise, a factor of is accumulated at every recursive step, and the contributed error from the matrices comes from the leaf (that is involved in the summation) with maximum error. Leaf matrix is involved in the summation for if , where and . Thus, we have
where .
Let . In order to determine the largest accumulated error, we compute the maximum over all output blocks :
where is given in LABEL:def:prefactor.
We now compute the forward error bound for each block of the output matrix. We have , which implies (using LABEL:eqn:ST_exact)
Let . In order to determine the largest intermediate quantity, we compute the maximum over all output blocks :
where is given in LABEL:def:stability.
Computing by maximizing over and separately, we obtain our result. We note that the two quantities may not achieve their maxima for the same , but we ignore the possible looseness as the overall bound will typically be dominated by the value of .
Note that if (full recursion), the bound in LABEL:thm:stationary_analysis becomes
which is the bound provided by Demmel et al. [12], assuming , , all nonzeros of have the same value, all nonzeros of have the same value, and all nonzeros of have the same value. If (no recursion), we get the familiar bound
Example 4.
Because our definition of (LABEL:def:stability) accounts for the magnitude of the entries , , and in situ, the bound from LABEL:thm:stationary_analysis can be tighter than previous analyses [4, 12] when , , or has entries outside of . As an example, we consider a algorithm, where the and matrices have entries in [3] (see LABEL:app:fast442). For this algorithm, according to LABEL:def:stability is , while according to previous work is .
4.4 Forward error analysis of uniform, nonstationary algorithms
Recall that uniform, nonstationary algorithms use a single algorithm at each recursive level. We denote the prefactor vector, stability vector, and partition dimensions of algorithm at level by , , and , , and . Using a similar analysis to LABEL:sec:stationary, we get the following stability bound for this class of algorithms:
Theorem 5.
Suppose that is computed by a uniform, nonstationary algorithm with recursive steps of fast matrix multiplication, with the fast algorithm used at level and the classical algorithm used to multiply the matrices at the base case of the recursion. Then the computed matrix satisfies
Proof.
The proof is similar to the proof of LABEL:thm:stationary_analysis. The largest accumulation error now satisfies
and the largest intermediate growth quantity satisfies
4.5 Forward error analysis of nonuniform, nonstationary algorithms
We now consider nonstationary algorithms where the algorithm may be nonuniform at every given recursive level of fast matrix multiplication. That is, at any node in the recursion tree, we may choose a different fast algorithm. For simplicity, we assume that at level in the recursion tree, all algorithms have the same partitioning scheme and rank (so that the representations have the same dimensions across all values ) and that after levels of recursion, all leaf nodes use the classical algorithm.
In the case of stationary algorithms, one defines the entire algorithm; in the case of uniform nonstationary algorithms, choices of define the entire algorithm; in this case, we have much more flexibility and can choose different fast algorithms (the number of internal nodes of the recursion tree). Recall that we use the notation as a superscript to refer to the algorithm used at level in the recursion tree, where defines subtree membership at level 1, defines subtree membership at level 2, and so on, and defines the subtree node at the th level.
Our analysis of these algorithms is fundamentally the same—we bound the accumulated error () and then bound the number of terms (). However, maximizing over all output blocks is not as straightforward and cannot be simplified as cleanly as in the previous cases. In particular, we define the largest accumulation error recursively as , where
This expression does not simplify as before. Note that for block of the output matrix, node at level of the recursion tree accumulates error for the additions/subtractions required by matrix at that node plus the maximum accumulated error from any of the combined terms. The expression for reflects the number of additions and subtractions required to produce the factor matrices and at the leaf nodes, and the error accumulated during the classical matrix multiplications is included in the definition of .
Likewise, the largest intermediate growth quantity is , where
which we can simplify to
where and vectors are defined as in LABEL:eqn:ab. Note that we cannot simplify further as in the uniform case.
In Section LABEL:sec:searching_nuns, we use nonuniform, nonstationary algorithms to improve the numerical stability of fast matrix multiplication algorithms.
5 Algorithm selection
LABEL:thm:stationary_analysis immediately provides several options for improving the numerical stability of fast matrix multiplication. First, we can look for algorithms with a smaller and . Since prior work on finding fast algorithms focuses on performance, this provides a new dimension for algorithm design. And in LABEL:sec:search_stationary, we compare several stationary algorithms for the same base case as a first step in this dimension of algorithm design. We then extend this analysis to nonuniform, nonstationary algorithms in LABEL:sec:searching_nuns. Second, we can reduce the number of recursive levels before using standard matrix multiplication However, fewer recursive levels means an asymptotically slower algorithm. We examine this tradeoff in LABEL:sec:recursive_levels. Finally, we can also reduce and by pre and postprocessing the data, and we provide several such strategies in LABEL:sec:scaling.
5.1 Searching for better stationary algorithms
Typically, the only quantity of interest for finding fast matrix multiplication algorithms is the rank of the solution, which controls the asymptotic complexity. However, we can also search for algorithms to minimize the and quantities while maintaining the same rank. This will improve the numerical stability of the algorithm without sacrificing (asymptotic) performance. We will also consider the number of nonzeros (nnz) in the solution, i.e., the sum of the number of nonzero entries in , , and , as this affects the constant in the asymptotic complexity and has noticeable impact on empirical performance [3]. Thus, the parameters of interest for these algorithms is a performancestability 3tuple (nnz, , ). In general, the number of nonzeros is positively correlated with and , since these quantities directly depend on the nonzero patterns of , , and (see LABEL:eqn:emax and LABEL:eqn:qmax).
We first examined the base case , which has outperformed Strassen’s algorithm in practice [3]. We found 479 algorithms with rank using numerical lowrank tensor decomposition search techniques [3]. Of these, there were 208 performancestability tuples. The smallest nnz, , and quantities over all algorithms were 130, 12, and 32, and the corresponding algorithms had performancestability tuples (130, 14, 34), (138, 12, 34), and (134, 13, 32). No algorithm we found had parameters that achieved more than one of these minima, so we call these three algorithms semioptimal. Consequently, there is a theoretical tradeoff between performance and stability. We note that although this list of algorithms is not exhaustive, they are the only publicly available algorithms.^{2}^{2}2All of our algorithms, as well as the software for finding them, are publicly available at https://github.com/arbenson/fastmatmul.
We tested the stability of these algorithms by computing the product of samples of random matrices and . The distributions were , and , . In addition to the three semioptimal algorithms described above, we also compared against an algorithm with a much worse performancestability tuple of (156, 26, 132), which we call a suboptimal algorithm. For each pair of matrices, we ran the four algorithms with number of recursive levels . Our goal here is to compare the errors of different algorithms with the same base case and varying number of recursive levels—we are not claiming that any of these algorithms are the best to use for these problem dimensions.
To estimate , we computed using the classical algorithm in quadruple precision arithmetic. All other computations used double precision arithmetic. Overall, we computed the errors for 100 random pairs and for each distribution. LABEL:fig:errors423 reports the maximum error over the 100 trials for each algorithm and each number of recursive levels as well as the upper bound on the error from LABEL:thm:stationary_analysis. We see the following results:

The error bounds are still pessimistic, even with the improved analysis from LABEL:thm:stationary_analysis. Furthermore, the error bounds for the three semioptimal algorithms are quite similar.

The true error increases with the number of recursive levels, as predicted by LABEL:thm:stationary_analysis and modeled by the error bound.

For both distributions, the suboptimal algorithm has larger errors than the semioptimal algorithms, as modeled by the error bound.

The difference between the semioptimal algorithms depends on the matrices. For the distribution, there is a clear difference in error between the algorithms. Interestingly, the (134, 13, 32) semioptimal algorithm has larger errors than the (130, 14, 34), even though the former algorithm has strictly better and parameters. For the distribution, the errors of the semioptimal algorithms are practically indistinguishable.
We also considered the base case, which has optimal rank [5]. One known algorithm that achieves the optimal rank uses Strassen’s algorithm on a subblock and classical matrix multiplication on the remaining subblocks. The base case of the algorithm is small enough so that we could use a SAT solver [10] to find over 10,000 rank 11 algorithms (ignoring symmetries). We found that the combination of Strassen’s algorithm with the classical algorithm had a strictly smaller performancestability triple than all of the other rank 11 solutions. We conclude that this algorithm is likely optimal in both a performance and a stability sense for the class of algorithms where the scalar multiplications are .
5.2 Searching for better nonuniform, nonstationary algorithms
Stationary algorithms benefit from their simplicity, but nonuniform, nonstationary algorithms provide a broader search space for algorithms with better numerical properties. We provide several examples below.
Example 6.
D’Alberto [9] describes a nonuniform, nonstationary approach using Strassen’s algorithm that obtains a smaller stability factor than the original stationary algorithm (for ). Strassen’s algorithm, with as given in LABEL:app:strassen, has stability vector ; two levels of recursion with a stationary approach yields a twolevel stability vector of with maximum entry . D’Alberto shows that, for , a stability factor of 96 can be obtained with a nonuniform approach using one variant of Strassen’s algorithm. One way to achieve this stability factor is to use the alternative algorithm
for nodes , , and of the recursion tree, while using the original algorithm at nodes , , , , and . Similar improvements can be made based on the StrassenWinograd algorithm, which has a slightly larger stability factor.
A more generic nonuniform approach is described in a patent by Castrapel and Gustafson [8]. They consider eight variants of the StrassenWinograd algorithm, defined by
with . The correctness of these variants can be derived from the work of Johnson and McLoughlin [17, Equation (6)]. Castrapel and Gustafson suggest using random, roundrobin, or matrixdependent selections of algorithms to more evenly distribute the error, but they do not prove that any particular techniques will reduce the stability factor.
Example 7.
We can improve the twolevel stability factor for the case in a similar manner. The smallest stability factor we have discovered for this case is , given by the in LABEL:app:fast323, which has stability vector
Compared to a uniform twolevel stability factor of , we can achieve a stability factor of 352 using 3 variants of the algorithm. We use the original algorithm at nodes , , , , , and , the variant
at nodes , , , and , and the variant
at nodes , , , , , and . We suspect that better twolevel stability factors are achievable.
5.3 Performance and stability tradeoffs with a small number of recursive levels
In addition to searching for better algorithms, we may also consider the effect of the number of recursive levels on the numerical stability. We now consider the performance and stability of fast matrix multiplication algorithms across several base cases and several values of . LABEL:tab:principal_quantities summarizes the best known (to us) stability factors () for several practical base case dimensions. The columns of the table represent the relevant performance and stability parameters for each algorithm, all of which can be computed from the representation.
The rank and the number of nonzeros (nnz), along with the number of recursive levels used, determine the number of floating point operations performed by the stationary version of the algorithm. The rank can be compared to the product , the rank of the classical algorithm for that base case. The quantities and are computed using LABEL:def:stability and LABEL:def:prefactor, respectively; for a given base case we report the algorithm with the best known along with that algorithm’s . We do not report both and because the best algorithms for each have identical nnz, , and parameters, due to transformations corresponding to transposition of the matrix multiplication.
Although we stress that these algorithms will be used with only a few levels of recursion, we also report the asymptotic stability exponent (stab. exp.) in order to compare algorithms across different base case dimensions. If an algorithm for a square base case is used on square matrices of dimension down to subproblems of constant dimension, the bound of LABEL:thm:stationary_analysis can be simplified to
(12) 
where is a constant that depends in part on . In this case, the stability exponent is . We note that the first two rows of LABEL:tab:principal_quantities match the results of Bini and Lotti [4, Table 2]. The most stable rank23 algorithm of which we are aware is a cyclic rotation of the one given by Smirnov [24]. In the case of rectangular base cases , we assume a uniform, nonstationary algorithm based on cyclic use of algorithms for , , and , where the three recursive rules are transformations of each other, either by cyclic rotations or transposition (for more details, see LABEL:app:fast442 and LABEL:app:fast323).
LABEL:fig:stability_performance shows the distribution of relative instability and percentage of classical flops for the algorithms in LABEL:tab:principal_quantities, for . We measure both terms asymptotically. Ignoring the quadratic cost of additions, the percentage of classical flops is given by . For large matrix dimension and small, we can ignore by LABEL:thm:stationary_analysis, and we define the relative instability to be , which is the factor by which the error bound exceeds that of the classical algorithm. In general, most algorithms follow a narrow loglinear tradeoff between these two parameters. However, there is still room to select algorithms for a fixed number of recursion levels. For example, with , the algorithm has roughly the same stability and does fewer floating point operations than Strassen’s algorithm.
6 Scaling
We now turn our attention to strategies for pre and postprocessing matrices in order to improve numerical stability. The error bounds from LABEL:sec:bounds can be summarized by the following elementwise absolute error bound