Improving the numerical stability offast matrix multiplication

# Improving the numerical stability of fast matrix multiplication

Grey Ballard. This author’s work was supported by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its U.S. Department of Energy Contract No. DE-AC04-94AL85000. Sandia National Laboratories, Livermore, California. Current address: Computer Science Department, Wake Forest University, Winston Salem, North Carolina (ballard@wfu.edu)    Austin R. Benson. This author’s work was supported by an Office of Technology Licensing Stanford Graduate Fellowship. Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California. (arbenson@stanford.edu)    Alex Druinsky This material is based upon work supported by the US Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research (ASCR), Applied Mathematics program under contract number DE-AC02-05CH11231. Lawrence Berkeley National Laboratory, Berkeley, California. (adruinsky@lbl.gov)    Benjamin Lipshitz San Bruno, California (benjamin.lipshitz@gmail.com).    Oded Schwartz The Hebrew University, Jerusalem, Israel (odedsc@cs.huji.ac.il) Research is supported by grants 1878/14 and 1901/14 from the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities) and grant 3-10891 from the Ministry of Science and Technology, Israel; the Einstein Foundation and the Minerva Foundation; the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI); a grant from the United States-Israel Binational Science Foundation (BSF), Jerusalem, Israel; and the HUJI Cyber Security Research Center in conjunction with the Israel National Cyber Bureau in the Prime Minister’s Office.
###### Abstract

Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen’s original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen’s algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit.

We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.

Key words. Practical fast matrix multiplication, error bounds, diagonal scaling

## 1 Introduction

After Strassen’s discovery of an algorithm for dense matrix-matrix multiplication in 1969 [25] that reduced the computational complexity from the classical (for multiplying two matrices) to , there has been extensive effort to understand fast matrix multiplication, based on algorithms with computational complexity exponent less than 3. From a theoretical perspective, there remains a gap between the best known lower bound [21] and best known upper bound [14] on the exponent. From a practical perspective, it is unlikely that the techniques for obtaining the best upper bounds on the exponent can be translated to practical algorithms that will execute faster than the classical one for reasonably sized matrices. In this paper, we are interested in the numerical stability of practical algorithms that have been demonstrated to outperform the classical algorithm (as well as Strassen’s in some instances) on modern hardware [3].

Nearly all fast matrix multiplication algorithms are based on recursion, using a recursive rule that defines a method for multiplying matrices of fixed dimension by scalar multiplications. In this work, we use the notation for these algorithms. For practical algorithms, these fixed dimensions need to be very small, typically , as they define the factors by which the dimensions of subproblems are reduced within the recursion. Many such algorithms have been recently discovered [3, 24]. Most fast algorithms share a common bilinear structure and can be compactly represented by three matrices that we denote by , following the notation of Bini and Lotti [4]. Many key properties of the practicality of an algorithm, including its numerical stability, can be derived quickly from its representation. We also note that, because recursive subproblems are again matrix multiplications, different recursive rules can be combined arbitrarily. Following the terminology of Ballard et al. [2] and Demmel et al. [12], we refer to algorithms that vary recursive rules across different recursive levels and within each level as non-uniform, non-stationary algorithms. If an algorithm uses the same rule for every subproblem in each recursive level but varies the rule across levels, we call it a uniform, non-stationary algorithm; those defined by only one rule are called stationary algorithms.

Fast matrix multiplication is known to yield larger numerical errors than the classical algorithm. The forward error guarantee for the classical algorithm is component-wise: the error bound for each entry in the output matrix depends only on the dot product between the corresponding row and column of the input matrices. Fast algorithms perform computations involving other input matrix entries that do not appear in a given dot product (their contributions eventually cancel out), and therefore the error bounds for these algorithms depend on more global properties of the input matrices. Thus, fast algorithms with no modification are known to exhibit so-called norm-wise stability [4] (sometimes referred to as Brent stability [23]) while the classical algorithm exhibits the stronger component-wise stability, which is unattainable for fast algorithms [23].

Our main goals in this paper are to explore means for improving the theoretical error bounds of fast matrix multiplication algorithms as well as to test the improvements with numerical experiments, focusing particularly on those algorithms that yield performance benefits in practice. For computing , where is and is , norm-wise stability bounds for full recursion take the following form:

 ∥^C−C∥≤falg(K)∥A∥∥B∥ϵ+O(ϵ2),

where is the max-norm, is the machine precision, and is a polynomial function that depends on the algorithm [4, 12, 16].111Here and elsewhere, the term hides dependence on dimensions and norms of input matrices. For example, for the classical algorithm, with no assumption on the ordering of dot product computations. We note that is independent of the input matrices, and is independent of the algorithm. In this paper, we explore ways of improving each factor separately. Our main contributions include:

1. generalizing and tightening previous error analysis of stationary fast algorithms to bound in terms of the number of recursive steps used and two principal quantities derived from ;

2. presenting and comparing the stability quantities of recently discovered practical algorithms;

3. exploring means of improving algorithmic stability through algorithm selection and non-uniform, non-stationary combination of algorithms;

4. presenting diagonal scaling techniques to improve accuracy for inputs with entries of widely varying magnitudes; and

5. showing empirical results of the effects of the various improvement techniques on both error and performance.

The structure of the remainder of the paper is as follows. We describe related work in LABEL:sec:rel_work and introduce our notation for fast matrix multiplication algorithms in LABEL:sec:alg. LABEL:sec:bounds presents the error analysis for bounding for general fast algorithms, and LABEL:sec:selection discusses the implications of the bounds on known practical algorithms. We present diagonal scaling techniques in LABEL:sec:scaling, showing how to reduce the contribution of the input matrices to the error bound. Finally, we discuss our results in LABEL:sec:discussion and provide directions for improving implementations of fast matrix multiplication algorithms.

## 2 Related Work

Bini and Lotti [4] provide the first general error bound for fast matrix multiplication algorithms, and their analysis provides the basis for our results. Demmel et al. [12] generalize Bini and Lotti’s results and show that all fast algorithms are stable. A more complete summary of the numerical stability of fast algorithms, with a detailed discussion of Strassen’s algorithm along with Winograd’s variant, appears in Higham’s textbook [16, Chapter 23]. We discuss these previous works in more detail and compare them to our error bounds in LABEL:sec:bounds.

Castrapel and Gustafson [8] and D’Alberto [9] discuss means of improving the numerical stability of Strassen’s algorithm (and Winograd’s variant) using the flexibility of non-uniform, non-stationary algorithms. Castrapel and Gustafson propose general approaches to such algorithms, and D’Alberto provides a specific improvement in the case of two or more levels of recursion.

Smirnov [24] describes strategies for discovering practical fast algorithms and presents several new algorithms, including a rank-23 algorithm for with the fewest known nonzeros and an algorithm for that yields a better exponent than Strassen’s. Similar techniques are used by Benson and Ballard [3], and they demonstrate performance improvements over the classical and Strassen’s algorithm for both single-threaded and shared-memory multi-threaded implementations. Laderman et al. [20] and later Kaporin [18, 19] consider another form of practical algorithms, ones that can achieve fewer floating point operations than the Strassen-Winograd variant for certain matrix dimensions. Kaporin demonstrates better numerical stability than Strassen-Winograd and shows comparable performance. However, because the base case dimensions proposed are relatively large (e.g., 13 or 20), we suspect that the performance will not be competitive on today’s hardware. Further, because the representations are not readily available, we do not consider these types of algorithms in this work.

Dumitrescu [13] proposes a form of diagonal scaling to improve the error bounds for Strassen’s algorithm. We refer to his approach as outside scaling and discuss it in more detail in LABEL:sec:scaling. Higham [16] points out that inside scaling can also affect the error bound but does not propose a technique for improving it. Demmel et al. [11] and Ballard et al. [1] state (without proof) improved error bounds using either inside or outside diagonal scaling.

## 3 Fast Matrix Multiplication Algorithms

Fast algorithms for matrix multiplication are those that perform fewer arithmetic operations than the classical algorithm in an asymptotic sense, achieving a computational complexity exponent less than 3 for the square case. We consider such fast algorithms to be practical if they have been (or likely can be) demonstrated to outperform the most efficient implementations of the classical algorithm on current hardware [3]. From a practical viewpoint, because matrices arising in current applications have limited size, we can consider a fast algorithm’s recursive rule being applied only a few times. In light of this viewpoint, we state our algorithms (and error bounds) in terms of the number of recursive levels rather than the dimension of the base case, where the number of recursive levels need not be considered a fixed quantity. In the rest of this section, we state the notational and terminology of the fast algorithms we consider in this paper.

### 3.1 Base Case Algorithms

A bilinear non-commutative algorithm that computes a product of an matrix and a matrix () using non-scalar (active) multiplications is determined by a matrix , a matrix , and a matrix such that

 ck=R∑r=1wkrmr,wheremr:=sr⋅tr,sr:=M0K0∑i=1uirai,tr:=K0N0∑j=1vjrbj, (1)

for . Here, the single indices of entries of and assume column-major order, the single indices of entries of assume row-major order, and signifies an active multiplication. We will refer to the dimensions of such an algorithm with the notation , the rank of the algorithm by , and the set of coefficients that determine the algorithm with the notation .

### 3.2 Stationary Algorithms

Now we consider multiplying an matrix by a matrix . We will assume that , , and are powers of , , and ; otherwise, we can always pad the matrices with zeros and the same analysis will hold. The fast algorithm proceeds recursively by first partitioning into submatrices of size and into submatrices of size and then following LABEL:eqn:bilinear_alg by matrix blocks, i.e.,

 Ck=R∑r=1wkrMr,whereMr:=Sr⋅Tr,Sr:=M0K0∑i=1uirAi,Tr:=K0N0∑j=1vjrBj (2)

for , where signifies a recursive call to the algorithm. Here, we are using single subscripts on matrices as an index for the column- or row-major ordering of the matrix blocks. The algorithms in this class of fast matrix multiplication are called stationary algorithms because they use the same algorithm at each recursive step [12]. However, we do not assume that stationary algorithms recurse all the way to a base case of dimension 1; we assume only that the base case computation (of whatever dimension) is performed using the classical algorithm. Thus, a stationary algorithm is defined by the triplet of matrices along with a number of recursive levels used before switching to the classical algorithm.

### 3.3 Uniform, Non-Stationary Algorithms

In contrast to the stationary algorithms discussed above, uniform, non-stationary algorithms employ a different fast algorithm, in the sense of LABEL:eqn:recursive_alg and LABEL:eqn:bilinear_alg, at each recursive level [2]. The fast algorithm is the same at a given recursive level. Specifically, we will consider uniform, non-stationary algorithms with steps of recursion, so the algorithm is specified by matrices , , of dimensions , , , for .

Uniform, non-stationary algorithms are of particular interest for maximizing performance. The fastest algorithm for a particular triplet of dimensions , , and may depend on many factors; the same algorithm may not be optimal for the recursive subproblems of smaller dimensions. Assuming performance is fixed for a given triplet of dimensions, the flexibility of non-stationary algorithms allows for performance optimization over a given set of fast algorithms. However, in parallel and more heterogeneous settings, better performance may be obtained by the greater generality of non-uniform, non-stationary algorithms, described in the next section.

### 3.4 Non-Uniform, Non-Stationary Algorithms

The final class of matrix multiplication algorithms we consider are non-uniform, non-stationary algorithms. In contrast to the previous case, non-uniform, non-stationary algorithms use different algorithms within a single recursive level as well across recursive levels [2], though we restrict the dimension of the partition by the fast algorithms at a given recursive level to be the same. To define such algorithms, we must specify for every node in the recursion tree, a total of recursive rules. We use superscript notation to denote a recursive node at level , in the top-level subtree , second level subtree , and so on.

We demonstrate in LABEL:sec:non_uniform_non_stationary that the flexibility of these algorithms allows for an improvement in the numerical stability of multi-level recursive algorithms. We suspect that they also provide a performance benefit over stationary algorithms, though this has never been demonstrated empirically.

## 4 Error Analysis

The work by Bini and Lotti [4] provides the basic framework for the forward error analysis of fast matrix multiplication algorithms. They provide general bounds for any square, stationary bilinear algorithm with power-of-two coefficients (so that there is no error in scalar multiplications), assuming that full recursion is used (a base case of dimension 1). Demmel et al. [12] extend the work of Bini and Lotti by (1) accounting for errors induced by the scalar multiplications in bilinear algorithms, (2) analyzing uniform, non-stationary bilinear fast matrix multiplication algorithms (algorithms that use different fast matrix multiplication routines at different levels of recursion), and (3) analyzing group-theoretic fast matrix multiplication algorithms. The bounds provided by Demmel et al. also assume square algorithms and that full recursion is used. Higham [16] provides bounds for Strassen’s original algorithm as well as Winograd’s variant in terms of the base case dimension , where the recursion switches to the classical algorithm. Higham’s bounds are also slightly tighter (in the case of Strassen’s and Winograd’s algorithms) than the general bounds previously mentioned. We note that any matrix multiplication algorithm satisfying the component-wise error bound must perform at least arithmetic operations; that is, we cannot get the same component-wise error bounds even when using just one step of recursion of a fast algorithm [23].

The goal of the error analysis provided in this section is to generalize the previous work in two main directions and to tighten the analysis particularly in the case that nonzeros of , , and are not all . First, we will consider rectangular fast algorithms; that is, instead of considering recursive rules for multiplying two matrices, we consider the more general set of rules for multiplying an matrix by a matrix. Second, we will state our general bounds in terms of the number of levels of recursion used. Motivated by the results of recently discovered practical algorithms [3, 24], we would like to understand the theoretical error guarantees of an algorithm in terms of its representation. The recent performance results show that rectangular algorithms have practical value (even for multiplying square matrices) and that, for performance reasons, typically only a small number of recursive steps are used in practice. Several recently discovered practical algorithms include fractional power-of-two coefficients (e.g., , , ), and we expect that other currently undiscovered useful algorithms will include fractional coefficients that are not powers of two. Therefore, we make no assumptions on the entries of , , and , and we derive principal quantities below that can be tighter than the analogous quantities in the previous works by Bini and Lotti [4] and Demmel et al. [12], particularly in the case of fractional coefficients. This sometimes leads to much sharper error bounds (see LABEL:ex:better_emax).

Finally, we point out that our representation of non-uniform, non-stationary algorithms is more convenient than previous work. Careful choices of non-uniform, non-stationary algorithms have been shown to improve the numerical stability over stationary approaches (see LABEL:ex:nunsStr) [9]. Bini and Lotti’s bounds [4] can be applied to such algorithms in terms of the global representation, but the size of that representation grows quickly with the number of recursive levels. Our representation (see LABEL:sec:nuns) and error bound (see LABEL:sec:non_uniform_non_stationary), given in terms of the base case rule used at each node in the recursion tree, allows for more efficient search of combinations of rules and has led to automatic discovery of more stable algorithms (see LABEL:ex:nuns323).

After defining the principal quantities of interest and specifying our model of computation, the rest of this section provides forward error bounds for each of the types of fast algorithms defined in LABEL:sec:alg. We warn the reader that there are notational similarities and (sometimes subtle) inconsistencies with previous work, as a result of our tightening of the analysis.

### 4.1 Principal quantities

Following the approach of Bini and Lotti [4], we identify two principal quantities associated with a fast algorithm that, along with the dimensions of the algorithm and the number of levels of recursion, determine its theoretical error bounds. The two principal quantities can be easily computed from the representation, and we define them in terms of the following vectors:

 αr:=M0K0∑i=1I(uir≠0)βr:=K0N0∑j=1I(vjr≠0)γk:=R∑r=1I(wkr≠0) (3)
 ar:=M0K0∑i=1|uir|br:=K0N0∑j=1|vjr| (4)

for and , where is the Boolean-valued indicator function with value 1 for true and 0 for false. That is, is the vector of numbers of nonzeros in the columns of , is the vector of numbers of nonzeros in the columns of , is the vector of numbers of nonzeros in the rows of , is the vector of column 1-norms of , and is the vector of column 1-norms of . When and have entries, and .

###### Definition 1.

The prefactor vector is defined entry-wise by

 qk=γk+maxr(αr+βr)I(wkr≠0) (5)

for , and the prefactor is defined as

 Q=maxkqk.

###### Definition 2.

The stability vector is defined entry-wise by

 ek=R∑r=1ar⋅br⋅|wkr| (6)

for , and the stability factor is defined as

 E=maxkek.

The principal quantities for several fast algorithms are listed in LABEL:tab:principal_quantities.

Bini and Lotti [4] provide a definition of for two different summation algorithms: sequential summation and serialized divide-and-conquer (see LABEL:sec:model). We choose the looser of these two bounds (sequential summation) for generality and simpler notation. However, our results are easily converted to the tighter case. Demmel et al.use the serialized divide-and-conquer algorithm in their analysis. Bini and Lotti’s analysis does not account for scalar (non-active) multiplication by elements of , , and , so their parameter depends only on the non-zero structure, rather than the magnitude of the elements in these matrices (cf. LABEL:eqn:ab and LABEL:def:stability). Demmel et al.do account for the multiplication by elements of , , and . However, their parameter is identical to that of Bini and Lotti, and their bound includes an additional factor of , where is the number of recursive levels and is the max-norm.

### 4.2 Model of arithmetic and notation

We follow the notation of Demmel et al. [12]. Let be the set of all errors bounded by (machine precision) and let . We assume the standard model of rounded arithmetic where the computed value of is for some . We use the set operation notation: , , and .

We define and note that as . Furthermore, we will not distinguish between singleton sets and an element when using this notation, e.g., . Finally, we will use the standard hat or notation to denote a computed value, e.g., or .

Under this arithmetic, the following fact for summation will be useful in our analysis

 fl(N∑i=1fl(ci⋅ai))∈(N∑i=1ci⋅ai)ΔN, (7)

where the algorithm for summation is simply to accumulate the terms one at a time, in sequential order. By using a serialized divide-and-conquer summation, we can also achieve

 fl(N∑i=1fl(ci⋅ai))∈(N∑i=1ci⋅ai)Δ1+⌈log2N⌉. (8)

For generality, we will assume the more pessimistic bound in LABEL:eqn:seq_summation_error. Our results can easily be modified for the error bounds in LABEL:eqn:dnc_summation_error.

We will also use the following property:

 fl(N∑i=1ciΔaj)∈(N∑i=1ci)ΔN+maxjaj. (9)

### 4.3 Forward error analysis of stationary algorithms

The following theorem states the forward error bound for a stationary algorithm in terms of the principal quantities and defined in LABEL:sec:pq, which can be readily determined from its representation. The sources of error are floating point error accumulation and possible growth in magnitude of intermediate quantities. The floating point error accumulation depends in part on and grows at worst linearly in . The growth of intermediate quantities depends on and grows exponentially in , which typically dominates the bound. LABEL:tab:principal_quantities shows the values of these quantities for a variety of algorithms.

###### Theorem 3.

Suppose that , where and is computed by using recursive steps of the fast matrix multiplication in LABEL:eqn:recursive_alg, with the classical algorithm used to multiply the matrices by the matrices at the base cases of the recursion. Then the computed matrix satisfies

 ∥^C−C∥≤(K/KL0+Q⋅L)(K/KL0)⋅EL∥A∥∥B∥ϵ+O(ϵ2),

where is the max-norm.

###### Proof.

We begin by analyzing how relative errors propagate as we form the and matrices. Let a superscript index in brackets denote a matrix formed at the specified level of recursion. Following LABEL:eqn:seq_summation_error, we have the following error at the first recursive level:

 ^S[1]r∈M0K0∑i=1uirAiΔαr,^T[1]r∈K0N0∑j=1vjrBjΔβr,

where and are defined in LABEL:eqn:abg.

This error propagates as we recurse. At the th level of recursion, the inputs to the fast algorithm are given as sums of matrices and , each with a possible error of and , respectively, for some index sets and and some integers and . Following LABEL:eqn:seq_summation_error and LABEL:eqn:recursive_alg, the algorithm simply accumulates an additional factor of and before the matrices are passed to the subsequent level of recursion. Thus, at the th level of recursion, we have

 ^S[L]r∈S[L]rΔαr1+⋯+αrL,^T[L]r∈T[L]rΔβr1+⋯+βrL, (10)

with . Note that in exact arithmetic,

 S[L]r=ML0KL0∑i=1ui1r1⋯uiLrLAi,T[L]r=KL0NL0∑j=1vj1r1⋯vjLrLBj, (11)

where and represent recursive orderings of the subblocks of and .

We now use the classical algorithm to multiply the computed and matrices at the leaves of the recursion. Because the inner dimension of each leaf-level matrix multiplication is , from LABEL:eqn:ST_error and LABEL:eqn:seq_summation_error we accumulate another factor of to obtain

 ^M[L]r∈S[L]rT[L]rΔχr+K/KL0,

where for .

Next, the computed matrices are added to form followingLABEL:eqn:recursive_alg. At the th level of recursion, sums of matrices , for appropriate index sets and including accumulated error for some integers , are added together to form the intermediate computed quantities . In the final step at the top of the recursion tree, we have

 ^Ck∈R∑r=1wkr^M[1]rΔγk,

where is as defined in LABEL:eqn:abg. Following LABEL:eqn:add_accum_err, if for some integers , then

 ^Ck∈R∑r=1wkrM[1]rΔγk+maxrxr⋅I(wkr≠0).

Likewise, a factor of is accumulated at every recursive step, and the contributed error from the matrices comes from the leaf (that is involved in the summation) with maximum error. Leaf matrix is involved in the summation for if , where and . Thus, we have

 ^Ck∈RL∑r=1wk1r1⋯wk1rLM[L]rΔμk+maxrχr⋅I(wk1r1⋯wkLrL≠0)+K/KL0,

where .

Let . In order to determine the largest accumulated error, we compute the maximum over all output blocks :

 maxkδk =K/KL0+maxk1,…,kL{μk+maxr1,…,rLχr⋅I(wk1r1⋯wkLrL≠0)} =K/KL0+maxk1{γk1+maxr1(αr1+βr1)I(wk1r1≠0)}⋅L=K/KL0+Q⋅L,

where is given in LABEL:def:prefactor.

We now compute the forward error bound for each block of the output matrix. We have , which implies (using LABEL:eqn:ST_exact)

 |Ek| ≤RL∑r=1∣∣wk1r1⋯wk1rLS[L]rT[L]r∣∣δkϵ+O(ϵ2)

Let . In order to determine the largest intermediate quantity, we compute the maximum over all output blocks :

 maxkξk =maxk1,…,kL∑r1,…,rL|wk1r1⋯wkLrL|∑i1,…,iL|ui1r1⋯uiLrL|∑j1,…,jL|vj1r1⋯vjLrL| =(maxk1∑r1|wk1r1|∑i1|ui1r1|∑j1|vj1r1|)L=EL,

where is given in LABEL:def:stability.

Computing by maximizing over and separately, we obtain our result. We note that the two quantities may not achieve their maxima for the same , but we ignore the possible looseness as the overall bound will typically be dominated by the value of .

Note that if (full recursion), the bound in LABEL:thm:stationary_analysis becomes

 ∥^C−C∥≤(1+Q⋅L)⋅ElogK0K∥A∥∥B∥ϵ+O(ϵ2)

which is the bound provided by Demmel et al. [12], assuming , , all nonzeros of have the same value, all nonzeros of have the same value, and all nonzeros of have the same value. If (no recursion), we get the familiar bound

 ∥^C−C∥≤K2∥A∥∥B∥ϵ+O(ϵ2).
###### Example 4.

Because our definition of (LABEL:def:stability) accounts for the magnitude of the entries , , and in situ, the bound from LABEL:thm:stationary_analysis can be tighter than previous analyses [4, 12] when , , or has entries outside of . As an example, we consider a algorithm, where the and matrices have entries in  [3] (see LABEL:app:fast442). For this algorithm, according to LABEL:def:stability is , while according to previous work is .

### 4.4 Forward error analysis of uniform, non-stationary algorithms

Recall that uniform, non-stationary algorithms use a single algorithm at each recursive level. We denote the prefactor vector, stability vector, and partition dimensions of algorithm at level by , , and , , and . Using a similar analysis to LABEL:sec:stationary, we get the following stability bound for this class of algorithms:

###### Theorem 5.

Suppose that is computed by a uniform, non-stationary algorithm with recursive steps of fast matrix multiplication, with the fast algorithm used at level and the classical algorithm used to multiply the matrices at the base case of the recursion. Then the computed matrix satisfies

 ∥^C−C∥≤⎛⎝K∏Ll=1K[l]0+L∑l=1Q[l]⎞⎠⎛⎝K∏Ll=1K[l]0⎞⎠⋅(L∏l=1E[l])∥A∥∥B∥ϵ+O(ϵ2).

###### Proof.

The proof is similar to the proof of LABEL:thm:stationary_analysis. The largest accumulation error now satisfies

 maxkδk

and the largest intermediate growth quantity satisfies

 maxkξk

### 4.5 Forward error analysis of non-uniform, non-stationary algorithms

We now consider non-stationary algorithms where the algorithm may be non-uniform at every given recursive level of fast matrix multiplication. That is, at any node in the recursion tree, we may choose a different fast algorithm. For simplicity, we assume that at level in the recursion tree, all algorithms have the same partitioning scheme and rank (so that the representations have the same dimensions across all values ) and that after levels of recursion, all leaf nodes use the classical algorithm.

In the case of stationary algorithms, one defines the entire algorithm; in the case of uniform non-stationary algorithms, choices of define the entire algorithm; in this case, we have much more flexibility and can choose different fast algorithms (the number of internal nodes of the recursion tree). Recall that we use the notation as a superscript to refer to the algorithm used at level in the recursion tree, where defines subtree membership at level 1, defines subtree membership at level 2, and so on, and defines the subtree node at the th level.

Our analysis of these algorithms is fundamentally the same—we bound the accumulated error () and then bound the number of terms (). However, maximizing over all output blocks is not as straightforward and cannot be simplified as cleanly as in the previous cases. In particular, we define the largest accumulation error recursively as , where

 δ[1]k =K∏Ll=1K[l]0+γ[1]k1+maxr1δ[2,r1]k⋅I(w[1]k1r1≠0), δ[2,r1]k =γ[2,r1]k2+maxr2δ[3,r1,r2]k⋅I(w[2,r1]k2r2≠0), ⋮ δ[l,r1,…,rl−1]k =γ[l,r1,…,rl−1]kl+maxrlδ[l+1,r1,…,rl]k⋅I(w[l,r1,…,rl−1]klrl≠0), ⋮ δ[L,r1,…,rL−1]k =γ[L,r1,…,rL−1]kL+maxrLχr⋅I(w[L,r1,…,rL−1]kLrL≠0), and χr =α[1]r1+β[1]r1+α[2,r1]r2+β[2,r1]r2+⋯+α[L,r1,…,rL−1]rL+β[L,r1,…,rL−1]rL.

This expression does not simplify as before. Note that for block of the output matrix, node at level of the recursion tree accumulates error for the additions/subtractions required by matrix at that node plus the maximum accumulated error from any of the combined terms. The expression for reflects the number of additions and subtractions required to produce the factor matrices and at the leaf nodes, and the error accumulated during the classical matrix multiplications is included in the definition of .

Likewise, the largest intermediate growth quantity is , where

which we can simplify to

where and vectors are defined as in LABEL:eqn:ab. Note that we cannot simplify further as in the uniform case.

In Section LABEL:sec:searching_nuns, we use non-uniform, non-stationary algorithms to improve the numerical stability of fast matrix multiplication algorithms.

## 5 Algorithm selection

LABEL:thm:stationary_analysis immediately provides several options for improving the numerical stability of fast matrix multiplication. First, we can look for algorithms with a smaller and . Since prior work on finding fast algorithms focuses on performance, this provides a new dimension for algorithm design. And in LABEL:sec:search_stationary, we compare several stationary algorithms for the same base case as a first step in this dimension of algorithm design. We then extend this analysis to non-uniform, non-stationary algorithms in LABEL:sec:searching_nuns. Second, we can reduce the number of recursive levels before using standard matrix multiplication However, fewer recursive levels means an asymptotically slower algorithm. We examine this tradeoff in LABEL:sec:recursive_levels. Finally, we can also reduce and by pre- and post-processing the data, and we provide several such strategies in LABEL:sec:scaling.

### 5.1 Searching for better stationary algorithms

Typically, the only quantity of interest for finding fast matrix multiplication algorithms is the rank of the solution, which controls the asymptotic complexity. However, we can also search for algorithms to minimize the and quantities while maintaining the same rank. This will improve the numerical stability of the algorithm without sacrificing (asymptotic) performance. We will also consider the number of non-zeros (nnz) in the solution, i.e., the sum of the number of non-zero entries in , , and , as this affects the constant in the asymptotic complexity and has noticeable impact on empirical performance [3]. Thus, the parameters of interest for these algorithms is a performance-stability 3-tuple (nnz, , ). In general, the number of non-zeros is positively correlated with and , since these quantities directly depend on the non-zero patterns of , , and (see LABEL:eqn:emax and LABEL:eqn:qmax).

We first examined the base case , which has out-performed Strassen’s algorithm in practice [3]. We found 479 algorithms with rank using numerical low-rank tensor decomposition search techniques [3]. Of these, there were 208 performance-stability tuples. The smallest nnz, , and quantities over all algorithms were 130, 12, and 32, and the corresponding algorithms had performance-stability tuples (130, 14, 34), (138, 12, 34), and (134, 13, 32). No algorithm we found had parameters that achieved more than one of these minima, so we call these three algorithms semi-optimal. Consequently, there is a theoretical trade-off between performance and stability. We note that although this list of algorithms is not exhaustive, they are the only publicly available algorithms.222All of our algorithms, as well as the software for finding them, are publicly available at https://github.com/arbenson/fast-matmul.

We tested the stability of these algorithms by computing the product of samples of random matrices and . The distributions were , and , . In addition to the three semi-optimal algorithms described above, we also compared against an algorithm with a much worse performance-stability tuple of (156, 26, 132), which we call a sub-optimal algorithm. For each pair of matrices, we ran the four algorithms with number of recursive levels . Our goal here is to compare the errors of different algorithms with the same base case and varying number of recursive levels—we are not claiming that any of these algorithms are the best to use for these problem dimensions.

To estimate , we computed using the classical algorithm in quadruple precision arithmetic. All other computations used double precision arithmetic. Overall, we computed the errors for 100 random pairs and for each distribution. LABEL:fig:errors423 reports the maximum error over the 100 trials for each algorithm and each number of recursive levels as well as the upper bound on the error from LABEL:thm:stationary_analysis. We see the following results:

1. The error bounds are still pessimistic, even with the improved analysis from LABEL:thm:stationary_analysis. Furthermore, the error bounds for the three semi-optimal algorithms are quite similar.

2. The true error increases with the number of recursive levels, as predicted by LABEL:thm:stationary_analysis and modeled by the error bound.

3. For both distributions, the sub-optimal algorithm has larger errors than the semi-optimal algorithms, as modeled by the error bound.

4. The difference between the semi-optimal algorithms depends on the matrices. For the distribution, there is a clear difference in error between the algorithms. Interestingly, the (134, 13, 32) semi-optimal algorithm has larger errors than the (130, 14, 34), even though the former algorithm has strictly better and parameters. For the distribution, the errors of the semi-optimal algorithms are practically indistinguishable.

We also considered the base case, which has optimal rank  [5]. One known algorithm that achieves the optimal rank uses Strassen’s algorithm on a sub-block and classical matrix multiplication on the remaining sub-blocks. The base case of the algorithm is small enough so that we could use a SAT solver [10] to find over 10,000 rank 11 algorithms (ignoring symmetries). We found that the combination of Strassen’s algorithm with the classical algorithm had a strictly smaller performance-stability triple than all of the other rank 11 solutions. We conclude that this algorithm is likely optimal in both a performance and a stability sense for the class of algorithms where the scalar multiplications are .

### 5.2 Searching for better non-uniform, non-stationary algorithms

Stationary algorithms benefit from their simplicity, but non-uniform, non-stationary algorithms provide a broader search space for algorithms with better numerical properties. We provide several examples below.

###### Example 6.

D’Alberto [9] describes a non-uniform, non-stationary approach using Strassen’s algorithm that obtains a smaller stability factor than the original stationary algorithm (for ). Strassen’s algorithm, with as given in LABEL:app:strassen, has stability vector ; two levels of recursion with a stationary approach yields a two-level stability vector of with maximum entry . D’Alberto shows that, for , a stability factor of 96 can be obtained with a non-uniform approach using one variant of Strassen’s algorithm. One way to achieve this stability factor is to use the alternative algorithm

for nodes , , and of the recursion tree, while using the original algorithm at nodes , , , , and . Similar improvements can be made based on the Strassen-Winograd algorithm, which has a slightly larger stability factor.

A more generic non-uniform approach is described in a patent by Castrapel and Gustafson [8]. They consider eight variants of the Strassen-Winograd algorithm, defined by

with . The correctness of these variants can be derived from the work of Johnson and McLoughlin [17, Equation (6)]. Castrapel and Gustafson suggest using random, round-robin, or matrix-dependent selections of algorithms to more evenly distribute the error, but they do not prove that any particular techniques will reduce the stability factor.

###### Example 7.

We can improve the two-level stability factor for the case in a similar manner. The smallest stability factor we have discovered for this case is , given by the in LABEL:app:fast323, which has stability vector

 e=[202021242041220].

Compared to a uniform two-level stability factor of , we can achieve a stability factor of 352 using 3 variants of the algorithm. We use the original algorithm at nodes , , , , , and , the variant

at nodes , , , and , and the variant

at nodes , , , , , and . We suspect that better two-level stability factors are achievable.

### 5.3 Performance and stability trade-offs with a small number of recursive levels

In addition to searching for better algorithms, we may also consider the effect of the number of recursive levels on the numerical stability. We now consider the performance and stability of fast matrix multiplication algorithms across several base cases and several values of . LABEL:tab:principal_quantities summarizes the best known (to us) stability factors () for several practical base case dimensions. The columns of the table represent the relevant performance and stability parameters for each algorithm, all of which can be computed from the representation.

The rank and the number of nonzeros (nnz), along with the number of recursive levels used, determine the number of floating point operations performed by the stationary version of the algorithm. The rank can be compared to the product , the rank of the classical algorithm for that base case. The quantities and are computed using LABEL:def:stability and LABEL:def:prefactor, respectively; for a given base case we report the algorithm with the best known along with that algorithm’s . We do not report both and because the best algorithms for each have identical nnz, , and parameters, due to transformations corresponding to transposition of the matrix multiplication.

Although we stress that these algorithms will be used with only a few levels of recursion, we also report the asymptotic stability exponent (stab. exp.) in order to compare algorithms across different base case dimensions. If an algorithm for a square base case is used on square matrices of dimension down to subproblems of constant dimension, the bound of LABEL:thm:stationary_analysis can be simplified to

 ∥^C−C∥≤c⋅NlogN0ElogN∥A∥∥B∥ϵ+O(ϵ2), (12)

where is a constant that depends in part on . In this case, the stability exponent is . We note that the first two rows of LABEL:tab:principal_quantities match the results of Bini and Lotti [4, Table 2]. The most stable rank-23 algorithm of which we are aware is a cyclic rotation of the one given by Smirnov [24]. In the case of rectangular base cases , we assume a uniform, non-stationary algorithm based on cyclic use of algorithms for , , and , where the three recursive rules are transformations of each other, either by cyclic rotations or transposition (for more details, see LABEL:app:fast442 and LABEL:app:fast323).

LABEL:fig:stability_performance shows the distribution of relative instability and percentage of classical flops for the algorithms in LABEL:tab:principal_quantities, for . We measure both terms asymptotically. Ignoring the quadratic cost of additions, the percentage of classical flops is given by . For large matrix dimension and small, we can ignore by LABEL:thm:stationary_analysis, and we define the relative instability to be , which is the factor by which the error bound exceeds that of the classical algorithm. In general, most algorithms follow a narrow log-linear trade-off between these two parameters. However, there is still room to select algorithms for a fixed number of recursion levels. For example, with , the algorithm has roughly the same stability and does fewer floating point operations than Strassen’s algorithm.

## 6 Scaling

We now turn our attention to strategies for pre- and post-processing matrices in order to improve numerical stability. The error bounds from LABEL:sec:bounds can be summarized by the following element-wise absolute error bound

 |cij−^cij|≤falg(K)