Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations
Abstract
Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrixvector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent handcrafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrixvector multiplication as products of sparse matrices, we introduce a parameterization of divideandconquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the CooleyTukey FFT algorithm to machine precision, for dimensions up to . Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hiddenlayer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR10 by 3.9 points—the first time a structured approach has done so—with 4X faster inference speed and 40X fewer parameters.
1 Introduction
Structured linear transformations, such as the discrete Fourier transform (DFT), discrete cosine transform (DCT), and Hadamard transform, are a workhorse of machine learning, with applications ranging from data preprocessing, feature generation, and kernel approximation, to image and language modeling (convolutions). To date, these transformations rely on carefully designed algorithms, such as the famous fast Fourier transform (FFT) algorithm, and on specialized implementations (e.g., FFTW and cuFFT). Moreover, each specific transform requires handcrafted implementations for every platform (e.g., Tensorflow and PyTorch lack the fast Hadamard transform), and it can be difficult to know when they are useful. Ideally, these barriers would be addressed by automatically learning the most effective transform for a given task and dataset, along with an efficient implementation of it. Such a method should be capable of recovering a range of fast transforms with high accuracy and realistic sizes given limited prior knowledge. It is also preferably composed of differentiable primitives and basic operations common to linear algebra/machine learning libraries, that allow it to run on any platform and be integrated into modern ML frameworks such as PyTorch/Tensorflow. More fundamentally, this problem ties into the foundational question of understanding the minimal prior knowledge needed to learn highspeed systems, in the spirit of modern trends toward relaxing manually imposed structure (i.e., AutoML). Recent progress in this vein of learning computational primitives includes addition/multiplication gates [41] and the Strassen matrix multiplication algorithm [42].
We propose a method that addresses this problem for a class of important transforms that includes the aforementioned examples. A key challenge lies in defining or parameterizing the space of transforms and corresponding fast algorithms, which requires using a minimal amount of prior knowledge that captures important and interesting transforms while remaining learnable and efficient. Egner & Püschel [13, 14] previously posed this question and a novel combinatorial approach, but their solution only addresses a limited set of transforms (primarily DFT) and only on limited problem sizes. In particular, these approaches search through an exponentially large discrete space using a symbolic form of the matrix [13, 14] and recover the solution only up to dimensions . We instead draw two key lessons from the work of De Sa et al. [8], who characterize matrices with efficient matrixvector multiplication algorithms as being factorizable into products of sparse matrices.^{1}^{1}1This characterization was equivalently known in the language of arithmetic circuits [3]. Thus, the task of learning algorithms can be reduced to finding appropriate sparse matrix product representations of the transforms. They further show that divideandconquer schemes lead to fast multiplication algorithms for a surprisingly general set of structured matrices. Motivated by the broad applicability of this recursive structure, we propose a particular factorization using sequences of special block diagonal matrices, called butterfly matrices. Specific instances of butterfly structure have been used before—for example to implement the FFT algorithm or as a random orthogonal preconditioner [34] or projection [30]—but we use a relaxed representation that captures a much larger class of structures and can learn from data. This implicitly models a class of structured matrices with parameters and automatic fast multiplication in operations.
We empirically validate our method in two ways. First, we consider a specification of a transform (e.g., inputoutput pairs) and attempt to factorize it. We successfully recover a fast algorithm up to machine precision for several important transforms such as the DFT, Hadamard, DCT, and convolution for realistic sizes (dimensions up to ), while standard sparse and lowrank baselines cannot (Section 4.1). Beyond recovering famous transforms, we additionally incorporate this method in endtoend ML pipelines to learn fast and compressible latent transformations (Section 4.2). On the benchmark single hidden layer network, this parameterization exceeds the classification accuracy of a baseline fully connected layer on several datasets—such as by 3.9 points on CIFAR10 while using 40X fewer parameters—which is to our knowledge the first time a structured model has outperformed the unconstrained model for this task on a realistic dataset [40]. We also find that the addition of a lightweight butterfly layer improves the accuracy of a modern ResNet architecture by 0.43 points.
Finally, our method is simple with an easily implementable fast algorithm. We compare the training and inference speed of our implementation to specialized implementations of discrete transforms (Section 4.3). Our generic representation comes within 35X of implementations for specific transforms such as the DFT and DCT, while still being capable of learning a rich class of more general transforms.
2 Related Work
Fast transforms are crucial and ubiquitous in the machine learning pipelines, from data preprocessing, feature generation, and dimensionality reduction to compressing models. For example, the DFT and DCT form the basis of the melfrequency cepstral coefficients (MFCCs), a standard feature representation for speech recognition [20]. Stateoftheart kernel approximation methods leverage circulant matrices (i.e., convolution) [47] and the DFT and Hadamard transform [23, 48] for fast projection. Structured matrices, which are matrix representations of fast transforms, play a crucial role in designing fast neural network layers with few parameters [38, 10].
Given their importance, there have been significant efforts in finding more and more general classes of fast transforms. Traditional classes of structured matrices such as the Toeplitz, Hankel, Vandermonde, and Cauchy matrices are ubiquitous in engineering and signal processing [33]. These were generalized under the seminal notion of low displacement rank (LDR) introduced by Kailath et al. [21], and were later unified under a single class of displacement structure (the confluent Cauchylike matrices) introduced by Olshevsky & Shokrollahi [32] to solve the NevanlinnaPick interpolation problem. Another class of fast transforms that directly generalize the DFT and DCT are based on orthogonal polynomials [7], which find usage in areas from differential equations to optics. Both orthogonal polynomial transforms [12], and all of the previously introduced matrices with displacement rank structure, were significantly generalized under a single class by De Sa et al. [8]. Notably, almost all of the structured matrix classes mentioned here exhibit a form of recursive structure in their construction and superfast algorithms.
Since the product of sparse matrices immediately has a fast multiplication algorithm, the problem of sparse matrix factorization has been tackled in many settings. Sparse PCA [49] and dictionary learning [27] factor a matrix into two components, one of which is sparse. Sparse matrix factorization with more than two factors has also been considered, for example in the setting where the true matrix is the product of random sparse matrices [31], or in the context of learning multilayer sparse approximations [24, 25]. Our approach differs from these in that we focus on the recursive structure of the transforms—not just the sparsity of their factors—leading to sparse and structured transforms, and avoiding the discreteness problem inherent to learning sparsity.
Since most distinct transforms typically require significant work both to design fast algorithms and to efficiently implement them on different platforms, there have been attempts to automatically learn these fast algorithms. The field of algebraic signal processing [37] uses methods from representation theory of groups and algebras to automatically generate fast algorithms from the symbolic form of the transform matrix. However, these methods require search over a combinatoriallylarge discrete space, limiting their approaches to small matrices of size up to [14, 43]. Attempts to learn general algorithms such as matching [29], sorting [16], and traveling salesman [2] using differentiable architectures face a similar challenge of having to effectively explore a large discrete space. Thus, they only work for problems of size at most 100. By contrast, our approach relies on continuous optimization, and simplifies the discreteness of the problem into learning a simpler set of permutations, allowing us to recover fast algorithms for realistic dimensions.
Independently, there has been growing interest in compressed deep learning models, motivated by the goal of adapting them to resourceconstrained environments. A common approach for learning compressed models involves replacing the unconstrained weight matrices with a class of structured matrices and learning directly on the parametrization of that class. The most effective methods use matrix classes that are explicitly related to Fourier transforms [38], or employ highly specialized and complicated recursive algorithms [40]. As our method also implicitly defines a highly compressible subclass of matrices with linear parameters and efficient multiplication, it can be used as a dropin replacement for matrices in such endtoend ML models.
3 Recovering Fast Transforms
We now set up and describe our approach. We first reiterate the connection between fast algorithms and sparse matrix factorization, and briefly outline a quintessential divideandconquer algorithm (the FFT) as motivation.
We then elaborate the details of our method for learning particular recursive algorithms, including a core permutationlearning step that enables it to capture a wider range of structures. We also discuss the expressive power of these matrices, including which transforms they capture perfectly, and define a hierarchy of matrix classes built on butterflies that can theoretically capture richer recursive structures.
3.1 Preliminaries
Sparse factorizations
One method of constructing matrices with obvious fast matrixvector multiplication is as a product of sparse matrices, so that multiplication by an arbitrary vector will have cost proportional to the total sparsity of the matrices in the product.
Surprisingly, the converse is also true. The notion of sparse product width (SPW) [8], which roughly corresponds to the total sparsity of a factorization of a matrix, turns out to be equivalent to the length of the shortest straightline program describing a matrix (up to a constant). Hence it is an optimal descriptor of the algorithmic complexity of matrixvector multiplication on these types of models [3].
Given the general correspondence between sparse factorization and fast algorithms, we consider specific cases of discrete transforms and their recursive factorizations. This is a prototype for our parameterization of fast recursive algorithms in Section 3.2.
Case study: DFT
The Discrete Fourier Transform (DFT) transforms a complex input vector into a complex output vector by expressing the input in the basis of the complex exponentials:
Let denote a primitive th root of unity. The DFT can be expressed as matrix multiplication by the DFT matrix , where . The DFT of size can be reduced to two DFTs of size on the even indices and the odd indices:
where , , and is the diagonal matrix with entries . This recursive structure yields the efficient recursive CooleyTukey Fast Fourier Transform (FFT) algorithm. This computation can be written as a matrix factorization
where is the identity matrix, and the last factor is the permutation matrix that separates the even and the odd indices (e.g., mapping to ) (see Figure 2).
Unrolling the recursion, we obtain:
(1)  
The product of all the matrices on the left is called a butterfly matrix, and each factor is a block matrix of diagonal matrices called a butterfly factor. Figure 1 illustrates the sparsity pattern of the structured butterfly factors. One can also combine the block of permutation matrices on the right to obtain one permutation called the bitreversal permutation, which sorts the indices by the reverse of their binary representation (e.g. ).
Other transforms have similar recursive structure but differ in the entries of , and in the permutation. For example, the DCT involves separating the even and the odd indices, and then reversing the second half (e.g., ).
Appendix A provides some examples of how important transforms, such as the DFT, DCT, Hadamard, and convolutions, can factor as similar products of sparse matrices.
3.2 Recovering Fast Transform Algorithms
Many previous works attempt to compress generic matrices by sparsifying them. We note that allowing for products of matrices with a total sparsity budget is strictly more expressive than a single matrix with that sparsity, while retaining the same compression and computation complexity. Therefore one can hope to recover all fast algorithms by learning over the set of matrix products with a total sparsity budget. However, this parameterization is difficult to learn due to the discreteness of the sparsity constraint (Section 1,2). We instead use a class of matrices built as products of specific factors that captures the recursive nature of many fast algorithms.
A butterfly parametrization
For simplicity, we assume that the input size is a power of 2.^{2}^{2}2Otherwise, the input can be padded with zeros. Let the input vector be and let the linear transform of size be . Let be the matrix representation of the transform , where the field is either or . A general recursive structure is to separate the input vector into two halves by some permutation, apply the transform on each half, and combine the result in a linear manner by scaling by an diagonal matrix and adding the results. Written as matrix factorization:
where is some permutation matrix and are diagonal matrices. Inspired by the factors of the FFT, we call the matrix a butterfly factor, denoted by . Unrolling the recursion as in equation (1) gives the factorization , where is a butterfly matrix and is a permutation that can be written as the product of simpler block permutations. We also consider composing this module, hence learn either
(2) 
which we term the BP and the BPBP parametrization respectively. One dimensional convolutions (i.e. circulant matrices) are notably captured by BPBP, since they can be computed via an FFT, a componentwise product, then an inverse FFT (see Appendix A).
In the case of the FFT, as in Section 3.1, the entries of the butterfly factors are called twiddle factors, and the combined permutation is called the bitreversal permutation.
Learning a recursive permutation
The butterfly blocks in the BP or BPBP parametrization have a fixed sparsity pattern and their parameters can be directly optimized. However, the transforms we are interested in capturing frequently require different permutations as part of the “divide” step, which form a set of discrete objects that we must consider. We will restrict to learning over permutations that have a simple structure often encountered in these algorithms: we assume that the distribution factors into steps following the recursive layers. At each step in the recursion, the permutation is allowed to either keep the first half and second half intact or separate the even and the odd indices (e.g., ). Then, it can choose to reverse the first half (e.g., ) and can choose to reverse the second half (e.g., ). Thus at each step, there are 3 binary choices and hence 8 possible permutations. These are illustrated in Figure 2, where denotes the permutation matrix on elements that separates the even and odd elements, denotes the permutation matrix that reverses the first half, and denotes the permutation matrix that reverses the second half.
Instead of searching over discrete permutations, we parameterize the permutation as a categorical distribution of these permutations. The permutation at step is thus chosen as a convex combination of the possible choices:
This can be learned by representing this probability distribution for example via logits and the softmax.
We make the further simplification that the probabilities factor into the three components; conceptually, that the choices of choosing to be part of the product are independent of each other. This results in the representation
(3) 
Thus we learn the permutation via equation (3) by optimizing over logits and setting , where is the sigmoid function.
To encourage the distribution over permutations to be peaked, one can add entropy regularization [15] or semantic loss [44]. However, in our experiments below in Section 4.1 and Section 4.2, we found that these tricks are not necessary. For example, the learned transforms in Section 4.1 typically have weights .
Initialization
As the BP or BPBP construction is a product of many matrices, proper initialization is crucial to avoid exponential blowup in the size of the entries or condition numbers (i.e., the exploding/vanishing gradient problem [35]). We aim to initialize each butterfly factor to be close to unitary or orthogonal, so that the magnitude of the inputs and outputs to the transform are preserved. Note that each of the factors has exactly two nonzeroes in each row and column. Thus if each entry is initialized independently with mean and standard deviation , then . Hence we initialize each entry of to . For complex entries, we initialize the real and imaginary components independently from to maintain the same norm scale.
Comparison to related methods
Some previous works have examined similar butterfly matrices in numerical algebra or machine learning [34, 19, 30], mainly motivated by trying to parametrize cheap orthogonal matrices. Our parametrization, motivated by the goal of learning recursive transforms, differs in several ways from all previous works: {enumerate*}[label=0.]
We explicitly model and learn a permutation matrix P.
Our relaxation does not enforce the matrix to be orthogonal.
Our butterfly factors are ordered so that closer elements interact first (Figure 1), whereas some works (e.g.,[30]) reverse the order.
Every work has a different weighttying scheme; ours ties the blocks in each butterfly factor, leading to fewer parameters and a tighter recursive interpretation than for example [19].
Our main baseline for deep learning experiments is Thomas et al. [40], who define a special matrix class with a complicated recursive algorithm. While our BP method and theirs share some overlap (e.g., they both capture circulant matrices), they have a distinct parametrization, and the exact relation between the BP hierarchy and their LDRSD or LDRTD classes is unknown. From a practical standpoint, BP is significantly faster and simpler to implement than their methods.
3.3 Expressivity and the butterfly hierarchy
The butterfly matrix has a total of learnable parameters (the butterfly factors , , …, have , , …, entries respectively). The overall permutation has learnable parameters; we can also tie the logits of the probabilistic permutations—reflecting the fact that for some algorithms the reduction from size to is selfsimilar to the reduction from size to —reducing to just parameters.
We can define a natural hierarchy of matrix classes built on the BP primitive. This hierarchy covers a spectrum ranging from extremely structured matrices with a linear number of parameters, to the entire space of square matrices.
Definition 1.
For any dimension , let (BP) () denote the classes of matrices that can be expressed as
where each is a BP module as in equation (2), and (that is, and select the upper left entries of the BP product matrix). The subscript is understood to be if omitted.
Note that the BP and BPBP classes are equivalent to (BP) and (BP) respectively. We remark that and are both capable of being the identity, and thus (BP) (BP).
The BP hierarchy is expressive enough to theoretically represent many important transforms with low depth, as well as all matrices with linear depth:
Proposition 1.
(BP) captures the fast Fourier transform, the fast Hadamard transform, and their inverses exactly. (BP) captures the DCT, DST, and convolution exactly. All matrices are contained in (BP).
Proposition 1 is shown in Appendix B. We suggest some additional conjectures about the expressiveness of the BP hierarchy in Appendix D.
Even though the BP parameterization is expressive, it still retains the learnability characteristic of compressed parameterizations. In fact, neural networks comprising layers of BP and BPBP matrices still have VC dimension that is almost linear in the number of parameters (Appendix B), similar to networks with fullyconnected layers [1, 17] and LDR [40], which implies a corresponding sample complexity bound.
4 Empirical Evaluation
We evaluate the proposed approach to verify that our butterfly parameterization can both recover fast transforms and be integrated as an effective component in ML pipelines. In Section 4.1, we confirm that it automatically learns the fast algorithms for many discrete transforms commonly used in signal processing and machine learning. Section 4.2 further shows that it can be a useful component to increase the performance of deep learning models while ensuring fast multiplication and few parameters by design.
4.1 Discrete Transforms
Below we list several important classes of structured matrices. Some of them are directly captured by our parametrization and we expect that they can be recovered close to perfectly, thus providing a algorithm that closely approximates the naive matrix multiplication. Others are not perfectly captured by the BPBP class but still have recursive structure; for these, we expect that our method reconstructs them better than standard matrix compression methods (sparse, lowrank, and combinations) can.
Transforms
We describe the matrices we evaluate on and their applications; a standard reference is Proakis [36]. Their explicit formulas are in Appendix A, Table 3.

Discrete Fourier transform (DFT): arguably the most important computational tool in signal processing, the FFT is one of the top 10 algorithms of the 20th century [11].

Discrete cosine transform (DCT): it expresses the input vector in the basis of cosine functions. It finds use in lossy compression of audio (MP3) and images (JPEG), in speech processing, and in numerical methods of solving partial differential equations (PDEs).

Discrete sine transform (DST): similar to the DCT, it expresses the input vector as a linear combination of sine functions. It is widely employed in spectral methods to solve PDEs.

Convolution: widely used in statistics, image processing, computer vision, and natural language processing.

Hadamard transform: commonly used in quantum information processing algorithms, and in ML as a fast random projection or kernel approximation method.

Discrete Hartley transform: similar to the DFT, but it transforms real inputs to real outputs. It was designed as a more efficient option than the DFT in the common case of real data.
Methods
We assume that the transform is fullyspecified, e.g., from linearly independent inputoutput pairs from which the matrix representation can be computed.
To recover the fast algorithm of the transform, we wish to approximate with the product of one or more blocks of butterfly and permutation products, by minimizing the Frobenius norm of the difference:
(4) 
By design, this factorization yields a fast algorithm for the transform.
We also compare to standard baselines for matrix factorization, maintaining the same total sparsity budget (i.e. computation cost of a multiplication) for each:

Sparse: this is the same as choosing the largest entries where is the sparsity budget.

Lowrank: the sparsity budget is used in the parameters of the lowrank factors, which can be found with a truncated SVD.

Sparse + lowrank: is minimized, where is sparse and is lowrank, by solving a convex problem.^{3}^{3}3Although there is an extra addition, this can also be written as a sparse product of 3 matrices by adding auxiliary identity blocks. This is commonly known as robust PCA [4].
Experimental procedure
We use the Adam optimizer [22] to minimize the Frobenius norm of the error, and use Hyperband [26] to automatically tune the hyperparameters (learning rates, random seed for initialization). The runs are stopped early if the average per entry difference (aka RMSE) is low enough: we consider RMSE below 1e4 (corresponding to the objective in (4) below 1e8, while we use 32bit floats with precision around 6e8) to mean that we successfully recover the fast algorithms for the transforms to machine precision. For consistency, we consider the unitary or orthogonal scaling of these transforms such that they have norm on the order of 1.0. For the DCT and DST, we add another simple permutation for extra learnability. All transforms considered learn over BP except for convolution which uses BPBP. All methods are optimized over complex entries.
Since the forward mapping of our butterfly parameterization is differentiable with respect to the entries of the butterfly matrices and the logits of the permutations, gradients are easily obtained with the help of an autodifferentiation framework. We provide our code in PyTorch.
Quality
Figure 3 visualizes the lowest error found by Hyperband for various matrix dimensions and several methods. Full numerical results are provided in Appendix C. As shown, we successfully recover the fast algorithms for these transforms up to for convolution and for other transforms. For example, the matrix factorization procedure recovers the bitreversal permutation applied at the beginning of the CooleyTukey fast Fourier transform. It also discovers many other unconventional permutations that also lead to exact factorization of the FFT.
We note that there are other transforms not captured by our parameterization. Orthogonal polynomial transforms, such as the discrete Legendre transform (DLT), are known only to have fast algorithms. They follow a slightly more general divideandconquer decomposition that we elaborate on in Appendix A.6. As expected, we find that the butterfly parameterization does not perfectly capture the DLT, but does recover it slightly better than the baselines.
Figure 3 also includes a baseline row factoring a matrix of appropriately scaled i.i.d. Gaussian entries, to indicate typical errors for factoring an unstructured matrix.
4.2 Neural Network Compression
Many structured matrix approaches have been proposed to replace fullyconnected (FC) layers of neural networks, to speed up training and inference, and to reduce the memory consumption. These structured matrices are cleverly designed by combining commonly used fast transforms. For example, Fastfood [23] and Deep Fried Convnets [45] compose the fast Hadamard transform and fast Fourier transforms, and Sindhwani et al. [38] use Toeplitzlike matrices that can be written as a sequence of 2 or 4 FFTs. However, the design choice for these lightweight replacement layers is restricted by the set of known and implementable transforms.
On the first benchmark task of compressing a single hidden layer model, the real version of BPBP has better classification accuracy than a fullyconnected layer on all datasets tested, and uses more than 56X fewer parameters (Table 1); the complex version performs even better with a slight parameter increase. The previous best methods fail to achieve this on the more challenging CIFAR10 dataset at the same parameter budget [40]. We further demonstrate that this layer is effective as a lightweight addition to a largerscale ResNet architecture.
Fully connected
Previous work showed that structured matrix approaches based on the low displacement rank framework, including Toeplitzlike [38], LDRSD and LDRTD matrices [40], compare very favorably to other compression approaches. Following previous experimental settings [5, 38, 40], we compare our proposed classes to several baselines using dense structured matrices to compress the hidden layer of a single hidden layer neural network. Competing methods include simple lowrank factorizations [9], circulant matrices (equivalent to 1dimensional convolutions) [6], the adaptive Fastfood transform [45], and low displacement rank methods [38, 40] which implicitly define a structured matrix through a displacement equation and admit specialized fast divideandconquer algorithms [8]. Our implementation is built on top of the publicly available implementation of Thomas et al. [40] with the same hyperparameters, and we report their numbers for the competing baseline methods directly. We test on the three main datasets from Thomas et al. [40]: two challenging variants of MNIST—one with randomly rotated images and random background, the other with correlated background noise—and the standard CIFAR10 dataset.
Method  MNISTbgrot  MNISTnoise  CIFAR10  Compression factor 
Unstructured  44.08  65.15  46.03  1 
BPBP (complex, fixed permutation)  46.26  77.00  49.93  39.4 
BPBP (real, fixed permutation)  46.16  75.00  48.69  56.9 
LDRTD [40]  45.81  78.45  45.33  56.9 
Toeplitzlike [38]  42.67  75.75  41.78  56.9 
Fastfood [45]  38.13  63.55  39.64  78.7 
Circulant [6]  34.46  65.35  34.28  93.0 
Lowrank [9]  35.67  52.25  32.28  56.9 
Table 1 reports results for variants of our butterfly parametrization, compared to the unstructured matrix baseline and other structured matrix approaches. Notably, the butterfly methods achieve higher classification accuracy than the fullyconnected layer on all datasets and are highly competitive with the other approaches.
We note that improvements over unconstrained matrices can arise from lower generalization error due to fewer parameters (relating to VC bounds, Proposition 2), or better inductive bias encoded by the structured class. For example, convolutions are important in image tasks due to encoding shift equivariance, and Thomas et al. [40] hypothesize that their structured classes improve over FC layers through imposing approximate equivariance to more general transformations. Since our BP parametrization can represent arbitrary convolutions, it can encode these important priors.
ResNet
In addition to the standard single hidden layer benchmarks, we test the effect of using butterfly layers in a standard ResNet18 [18] implementation on the CIFAR10 dataset. This architecture is normally fully convolutional, ending with a FC layer of dimensions before the softmax. However, we experiment with adding an additional FC or structured layer right before this final FC layer. Table 2 shows that the ResNet18 architecture can benefit from an additional fully connected layer, and using a BPBP layer instead improves performance even more while adding a negligible (0.07% increase) number of parameters to the original model.
Last layer  None  FC  BPBP 

Accuracy  93.58 0.15  93.89 0.19  94.01 0.09 
4.3 Training and Inference Speed Comparison
By design, the BP parameterization yields a fast algorithm of complexity , no matter which transform it learns. Moreover, given the parameters of the BP model, it is easy to implement this fast algorithm (this can be done in 5 lines of Python, and our code provides a function to do this automatically). The BP parameterization captures many common transforms (Section 4.1), and its implementation makes no transformspecific optimizations. Nevertheless, our simple implementation is surprisingly competitive with handtuned kernels both for training and for inference (after the parameters of the BP model are learned and we wish to evaluate for new input ). In Figure 4, we compare the speed of the BP fast multiplication against specialized implementation of common transforms such as the FFT, DCT, and DST (all have complexity ), using dense matrix vector multiply (GEMV, complexity as a baseline. For training with realistic input sizes and batch size 256 on GPU, the training time (forward and backward) of butterfly matrix is 15% faster than dense matrix multiply (GEMM from cuBLAS) and within 40% of FFT (from cuFFT). For inference on CPU, the BP fast multiplication can be one or two orders of magnitude faster than GEMV, is within a factor of 5 of the FFT, and is within a factor of 3 of the DCT and the DST, across a range of input sizes. The GEMM/GEMV and the FFT are two of the most heavily tuned numerical routines.
5 Conclusion
We address the problem of automatically learning fast algorithms for a class of important linear transforms, through a parameterization of recursive algorithms via butterfly factorizations. We validate our method by learning transforms including the DFT, DCT, Hadamard transform, and convolutions up to machine precision and dimension . Finally, we show that the same method yields consistent performance improvements and substantial compression and speed increases as a component of endtoend machine learning models.
Acknowledgments
We thank Maximilian Lam for his help with early experiments.
We gratefully acknowledge the support of DARPA under Nos. FA87501720095 (D3M) and FA86501827865 (SDH), NIH under No. N000141712266 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity) and CCF1563078 (Volume to Velocity), ONR under No. N000141712266 (Unifying Weak Supervision), the Moore Foundation, NXP, Xilinx, LETICEA, Intel, Google, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, and American Family Insurance, and members of the Stanford DAWN project: Intel, Microsoft, Teradata, Facebook, Google, Ant Financial, NEC, SAP, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.
References
 Bartlett et al. [1999] Bartlett, P. L., Maiorov, V., and Meir, R. Almost linear VC dimension bounds for piecewise polynomial networks. In Advances in Neural Information Processing Systems, pp. 190–196, 1999.
 Bello et al. [2016] Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with reinforcement learning. 2016.
 Bürgisser et al. [2013] Bürgisser, P., Clausen, M., and Shokrollahi, M. A. Algebraic complexity theory, volume 315. Springer Science & Business Media, 2013.
 Candès et al. [2011] Candès, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
 Chen et al. [2015] Chen, W., Wilson, J., Tyree, S., Weinberger, K., and Chen, Y. Compressing neural networks with the hashing trick. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2285–2294, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/chenc15.html.
 Cheng et al. [2015] Cheng, Y., Yu, F. X., Feris, R. S., Kumar, S., Choudhary, A., and Chang, S.F. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2857–2865, 2015.
 Chihara [2011] Chihara, T. An introduction to orthogonal polynomials. Dover Books on Mathematics. Dover Publications, 2011. ISBN 9780486479293. URL https://books.google.com/books?id=IkCJSQAACAAJ.
 De Sa et al. [2018] De Sa, C., Gu, A., Puttagunta, R., Ré, C., and Rudra, A. A twopronged progress in structured dense matrix vector multiplication. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1060–1079. SIAM, 2018.
 Denil et al. [2013] Denil, M., Shakibi, B., Dinh, L., De Freitas, N., et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.
 Ding et al. [2017] Ding, C., Liao, S., Wang, Y., Li, Z., Liu, N., Zhuo, Y., Wang, C., Qian, X., Bai, Y., Yuan, G., et al. CirCNN: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. ACM, 2017.
 Dongarra & Sullivan [2000] Dongarra, J. and Sullivan, F. Guest editorsâ introduction: The top 10 algorithms. Computing in Science & Engineering, 2(1):22–23, 2000.
 Driscoll et al. [1997] Driscoll, J. R., Healy, Jr., D. M., and Rockmore, D. N. Fast discrete polynomial transforms with applications to data analysis for distance transitive graphs. SIAM J. Comput., 26(4):1066–1099, August 1997. ISSN 00975397. doi: 10.1137/S0097539792240121. URL http://dx.doi.org/10.1137/S0097539792240121.
 Egner & Püschel [2001] Egner, S. and Püschel, M. Automatic generation of fast discrete signal transforms. IEEE Transactions on Signal Processing, 49(9):1992–2002, 2001.
 Egner & Püschel [2004] Egner, S. and Püschel, M. Symmetrybased matrix factorization. Journal of Symbolic Computation, 37(2):157–186, 2004.
 Grandvalet & Bengio [2005] Grandvalet, Y. and Bengio, Y. Semisupervised learning by entropy minimization. In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 529–536. MIT Press, 2005.
 Grover et al. [2019] Grover, A., Wang, E., Zweig, A., and Ermon, S. Stochastic optimization of sorting networks via continuous relaxations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1eSS3CcKX.
 Harvey et al. [2017] Harvey, N., Liaw, C., and Mehrabian, A. Nearlytight VCdimension bounds for piecewise linear neural networks. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 1064–1068, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR. URL http://proceedings.mlr.press/v65/harvey17a.html.
 He et al. [2016] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Jing et al. [2017] Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., and Soljačić, M. Tunable efficient unitary neural networks (eunn) and their application to rnns. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1733–1741. JMLR. org, 2017.
 Jurafsky & Martin [2014] Jurafsky, D. and Martin, J. H. Speech and language processing, volume 3. Pearson London, 2014.
 Kailath et al. [1979] Kailath, T., Kung, S.Y., and Morf, M. Displacement ranks of matrices and linear equations. Journal of Mathematical Analysis and Applications, 68(2):395–407, 1979.
 Kingma & Welling [2014] Kingma, D. P. and Welling, M. Autoencoding variational bayes. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), April 2014.
 Le et al. [2013] Le, Q., Sarlos, T., and Smola, A. Fastfood  computing Hilbert space expansions in loglinear time. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 244–252, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/le13.html.
 Le Magoarou & Gribonval [2015] Le Magoarou, L. and Gribonval, R. Chasing butterflies: In search of efficient dictionaries. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3287–3291, April 2015. doi: 10.1109/ICASSP.2015.7178579.
 Le Magoarou & Gribonval [2016] Le Magoarou, L. and Gribonval, R. Flexible multilayer sparse approximations of matrices and applications. IEEE Journal of Selected Topics in Signal Processing, 10(4):688–700, June 2016. ISSN 19324553. doi: 10.1109/JSTSP.2016.2543461.
 Li et al. [2017] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel banditbased approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
 Mairal et al. [2009] Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., and Bach, F. R. Supervised dictionary learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp. 1033–1040. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3448superviseddictionarylearning.pdf.
 Makhoul [1980] Makhoul, J. A fast cosine transform in one and two dimensions. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February 1980. ISSN 00963518. doi: 10.1109/TASSP.1980.1163351.
 Mena et al. [2018] Mena, G., Belanger, D., Linderman, S., and Snoek, J. Learning latent permutations with GumbelSinkhorn networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Byt3oJ0W.
 Munkhoeva et al. [2018] Munkhoeva, M., Kapushev, Y., Burnaev, E., and Oseledets, I. Quadraturebased features for kernel approximation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9165–9174. Curran Associates, Inc., 2018.
 Neyshabur & Panigrahy [2013] Neyshabur, B. and Panigrahy, R. Sparse matrix factorization. arXiv preprint arXiv:1311.3315, 2013.
 Olshevsky & Shokrollahi [2000] Olshevsky, V. and Shokrollahi, M. A. Matrixvector product for confluent cauchylike matrices with application to confluent rational interpolation. In Proceedings of the ThirtySecond Annual ACM Symposium on Theory of Computing, May 2123, 2000, Portland, OR, USA, pp. 573–581, 2000. doi: 10.1145/335305.335380. URL http://doi.acm.org/10.1145/335305.335380.
 Pan [2001] Pan, V. Y. Structured Matrices and Polynomials: Unified Superfast Algorithms. SpringerVerlag New York, Inc., New York, NY, USA, 2001. ISBN 0817642404.
 Parker [1995] Parker, D. S. Random butterfly transformations with applications in computational linear algebra. 1995.
 Pascanu et al. [2013] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318, 2013.
 Proakis [2001] Proakis, J. G. Digital signal processing: principles algorithms and applications. Pearson Education India, 2001.
 Puschel & Moura [2008] Puschel, M. and Moura, J. M. Algebraic signal processing theory. IEEE Transactions on Signal Processing, 56(8):3572–3585, 2008.
 Sindhwani et al. [2015] Sindhwani, V., Sainath, T., and Kumar, S. Structured transforms for smallfootprint deep learning. In Advances in Neural Information Processing Systems, pp. 3088–3096, 2015.
 Szegö [1967] Szegö, G. Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society, 1967. ISBN 9780821889527. URL https://books.google.com/books?id=3hcW8HBh7gsC.
 Thomas et al. [2018] Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018.
 Trask et al. [2018] Trask, A., Hill, F., Reed, S. E., Rae, J., Dyer, C., and Blunsom, P. Neural arithmetic logic units. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 8046–8055. Curran Associates, Inc., 2018.
 Tschannen et al. [2018] Tschannen, M., Khanna, A., and Anandkumar, A. StrassenNets: Deep learning with a multiplication budget. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4985–4994. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/tschannen18a.html.
 Voronenko & Puschel [2009] Voronenko, Y. and Puschel, M. Algebraic signal processing theory: Cooley–Tukey type algorithms for real DFTs. IEEE Transactions on Signal Processing, 57(1):205–222, 2009.
 Xu et al. [2018] Xu, J., Zhang, Z., Friedman, T., Liang, Y., and Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5502–5511. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/xu18h.html.
 Yang et al. [2015] Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483, 2015.
 Ye & Lim [2016] Ye, K. and Lim, L.H. Every matrix is a product of toeplitz matrices. Foundations of Computational Mathematics, 16(3):577–598, Jun 2016. ISSN 16153383. doi: 10.1007/s102080159254z. URL https://doi.org/10.1007/s102080159254z.
 Yu et al. [2015] Yu, F. X., Kumar, S., Rowley, H. A., and Chang, S. Compact nonlinear maps and circulant extensions. CoRR, abs/1503.03893, 2015.
 Yu et al. [2016] Yu, F. X. X., Suresh, A. T., Choromanski, K. M., HoltmannRice, D. N., and Kumar, S. Orthogonal random features. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1975–1983. Curran Associates, Inc., 2016.
 Zou et al. [2006] Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.
Appendix A Matrix Factorizations of Linear Transforms
Table 3 summarizes the transforms considered in Section 4.1. In general, they transform a (real or complex) vector into another (real or complex) vector by expressing the input signal in terms of another set of basis.
Transform  Formula 

DFT  
DCT  
DST  
Convolution  
Hadamard  , 
Hartley  
Legendre  , 
Randn 
a.1 Discrete Cosine Transform (DCT) Matrix
The DCT of a vector is defined as
As described in Makhoul [28], the DCT of can be written in terms of the FFT of order . To do this, we permute into by separating the even and odd indices and reversing the odd indices (e.g. ), taking the FFT of to obtain , and multiplying each () by and taking the real part to get .
Written in terms of matrix factorization:
where takes the real part and is a permutation matrix (the permutation done at the beginning of the DCT). Recall that has the form
where is the bitreversal permutation matrix. can be combined with to form another butterfly factor . Thus the DCT has this factorization:
This is a BP factorization (with the additional final step of computing the real part) with the left BP performing the FFT and final scaling, the right butterfly matrix as the identity, and the right permutation matrix as the permutation at the beginning of the DCT.
a.2 Discrete Sine Transform (DST) Matrix
The DST of a vector is defined as
Just as with the DCT, we express the DST of in terms of the FFT of order . First, we permute into by separating the even and odd indices and reversing the odd indices (e.g. ). However, since sine is an odd function, we must negate those elements in the second half of . Next, we take the FFT of to obtain . Finally multiply each () by and take the real part to get .
Written in terms of matrix factorization:
where takes the real part, is the matrix and is a permutation matrix (the permutation done at the beginning of the DST). Recall that has the form
where is the bitreversal permutation matrix. We may combine with to obtain a new butterfly factor, which we call . Thus the DST has this factorization:
Note that any diagonal matrix (e.g. ) is trivially representable as a butterfly matrix. Hence, this factorization of the DST is a BP factorization (with the additional final step of computing the real part) with the left BP performing the FFT and final scaling, the right butterfly matrix as , and the right permutation matrix as the permutation at the beginning of the DST.
a.3 Hadamard Matrix
The Hadamard matrix (for powers of 2) is defined recursively as , and . Thus we have the recursive factorization:
which is a BP factorization with each butterfly factor, and with permutation matrix . Here, the entries of the butterfly factors may be real, instead of complex.
a.4 Convolution
Here we apply the decomposition of FFT to see if we can learn the decomposition of fast convolution.
Suppose we have a fixed vector and the linear map we’re interested in is . We can write this convolution with explicitly as a circulant matrix:
We can compute convolution by the DFT:
where denotes the inverse Fourier transform matrix where and denotes elementwise multiplication. Since is just some fixed vector, elementwise multiplication with is just multiplication by some fixed diagonal matrix . Then
Note that the inverse Fourier transform has the same algorithm, and thus the same factorization, as the Fourier transform (with different twiddle factors, instead of ). Hence, we can express
where is the bitreversal permutation. We may fold the into to obtain a new butterfly factor , and we may similarly fold the diagonal matrix into to obtain a new butterfly factor . Hence, our final factorization of convolution / the circulant matrix is :
which is a (BP) factorization.
Similarly, the skewcirculant matrix also lies in (BP):
a.5 Toeplitz Matrices
Let be the Toeplitz matrix:
Define to be:
Then, . Note that the inner matrix is a circulant matrix that can be decomposed into a (BP) factorization as described in A.4. Therefore, our final factorization for Toeplitz matrices is contained within (BP).
a.6 Orthogonal Polynomial Matrices
Although the ability to represent general orthogonal polynomial matrices in terms of butterfly matrices is left as an open problem, we nonetheless present an alternate sparse factorization.
Definition 2.
A family of polynomials is orthogonal over if:



for all
We say that is parameterized by real sequences (with and each ).
Definition 3.
Given a family of orthogonal polynomials , we may define the orthogonal polynomial matrix such that:
For sake of clarify, we formulate the decomposition using matrices of polynomials. We note that each polynomial entry with degree bounded by may be expanded into a Toeplitz convolution matrix if one desires matrices of real coefficients.
For a given family of orthogonal polynomials parameterized by , let be a transition matrix defined by:
For convenience of notation, let . Let be a transition product matrix defined by:
From these definitions, we see that for all ,
We use this to formulate the following decomposition of the orthogonal polynomial matrix .
(5) 
The first “stretched” identity matrix serves the function of selecting every other entry from the vector of polynomials to its right. We focus our attention on the middle matrix. Noting that = for any , we may represent this block matrix as:
(6) 
Notice that the left matrix in this last expression consists of two matrices with the same structure as the first expression, but of half the size. Hence, we may repeat the same decomposition on each of the submatrices. In general, the decomposition becomes:
(7) 
Discrete Legendre Transform
The Discrete Legendre Transform (DLT) of a vector is defined as:
where is the ’th Legendre polynomial. The Legendre polynomials are a family of orthogonal polynomials with:
Hence, the DLT may be factored as described above.
Appendix B Proofs
b.1 VC Dimension Bound for Neural Network with Butterfly Layers
Proposition 2.
Let denote the class of neural networks with layers, each is a butterfly layer using the BP or BPBP parameterization, with fixed permutation, total parameters, and piecewise linear activations. Let denote the corresponding classification functions, i.e. . The VC dimension of this class is
Because the parameters within a layer interact multiplicatively, the standard VC dimension bound for fullyconnected layers [1, 17] does not apply directly. However, a variant of the same argument applies to the case where degree of multiplicative interaction is not too high [40, Theorem 3].
We provide a short proof of the VC dimension bound for neural networks with BP or BP layers based on this result.
Proof.
Theorem 3 of Thomas et al. [40] requires that the entries of the linear layer, as polynomials of the parameters, has degree at most for some universal constant , where is the size of output of the th layer. In our case, the BP or BPBP parameterization with fixed permutation has total degree at most in the parameters of , where is the size of the layer, since each is a product of matrices. It thus satisfies the condition of the theorem, and so the VC dimension is bounded to be almost linear in the number of parameters:
∎
b.2 Proposition 1
Proof.

The inclusion of the DFT in (BP) is shown in the Case study in Section 3.1. The inverse Fourier Transform has the same structure except the twiddle factors of the form are replaced with and all entries of the first butterfly factor are scaled by .

The inclusion of the Hadamard Transform in (BP) is shown in Section A.3.

The inclusion of the DCT in (BP) is shown in Section A.1.

The inclusion of the DST in (BP) is shown in Section A.2.

The inclusion of the convolution in (BP) is shown in Section A.4.

The inclusion of all matrices in (BP) follows from the fact that every matrix may be expressed by a product of at most Toeplitz matrices [46]. From Section A.5, we may conclude that all Toeplitz matrices are in (BP). Therefore, by appending the BP modules from each Toeplitz matrix, we see that a total of BP modules are needed. By left multiplying each butterfly factor by the diagonal matrix with 1s in the upper half and 0s in the lower half, we ensure that the upper left entries of the final product are exactly the product of the upper left entries of each BP module, as required. This diagonal matrix may be absorbed into the adjacent butterfly factor. Hence, the factorization is in (BP).
∎
Appendix C Experimental Details and Results
c.1 Recovering Fast Transforms
In Section 4.1, given a matrix representation of a transform, we use the BP or BPBP parameter to recover a fast algorithm to the transform. We report in Table 4 the root mean square error (RMSE) for different transforms and for different values of .
Transform  N = 8  16  32  64  128  256  512  1024 

DFT  3.1e06  4.6e06  8.7e06  1.0e05  2.0e05  3.8e05  8.0e05  5.7e05 
DCT  4.4e06  1.1e05  8.6e06  1.2e05  2.1e05  1.9e05  3.1e05  7.3e05 
DST  1.1e06  7.5e06  4.6e05  5.1e05  3.0e05  2.1e05  3.6e05  4.6e05 
Convolution  4.0e06  2.5e05  6.4e05  7.6e05  5.9e05  7.8e05  6.3e05  1.9e02 
Hadamard  8.8e07  7.8e06  1.3e05  3.9e05  3.5e05  4.5e05  6.1e05  3.6e05 
Hartley  3.4e06  9.0e06  1.1e05  1.3e05  3.6e05  4.3e05  4.5e05  3.6e05 
Legendre  3.4e02  2.9e02  2.4e02  1.4e02  7.9e03  4.5e03  2.6e03  1.6e03 
Randn  1.4e01  1.6e01  1.4e01  1.1e01  8.4e02  6.1e02  4.4e02  3.1e02 
We use Hyperband [26] to tune the hyperparameters, which include the learning rate (from 0.0001 to 0.5), initialization, and whether to share the logits in the permutation block .
c.2 Fully connected network
The model is a network with a single hidden layer of dimensions , where is the input dimension, followed by a fullyconnected softmax layer. We build on top of the framework of Thomas et al. [40]^{4}^{4}4Available at https://github.com/HazyResearch/structurednets, replacing the unconstrained or structured matrix with our PyTorch BPBP implementation. The CIFAR10 dataset is a grayscale version of input size 1024 since the single hidden layer architecture receives a single channel as input. With the exception of learning rate, hyperparameters such as batch size 50, validation set comprising 15% of training data, and fixed momentum at 0.9 are fixed as reported in Appendix F.1 of their paper. For the BP methods, the learning rate was tested for the values ; parameters outside this range were found to be ineffective. For each method, Table 1 reports the test accuracy of the model with the highest validation accuracy.
c.3 Resnet
We build on top of the standard ResNet18 model from PyTorch.^{5}^{5}5Available at https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py The model is modified for CIFAR10 by reducing the kernel size and stride for the initial convolution to and respectively, and removing the first max pool layer. Weight decay of was used. The learning rate was initialized in , and decayed by every 25 epochs for 100 epochs total. For each method, Table 2 reports the mean and standard deviation of the test accuracies for the hyperparameters with the highest average validation accuracy.
c.4 Speed Comparison
In Section 4.3, we benchmark the speed of training and inference of butterfly factorizations.
For training, we compare our CUDA implementation of the fast algorithm for butterfly matrices with dense matrixmatrix multiply (GEMM from cuBLAS) and FFT (from cuFFT). The batch size is 256, and we measure the total time of the forward and backward pass. The experiment is run on a Tesla P100 GPU with 16GB of memory.
For inference, we compare our simple Python implementation of the fast algorithm for the BP parameterization, against dense matrixvector multiplication (GEMV), FFT, DCT, and DST. Our BP parameterization here refers to the product of a butterfly matrix and a fixed permutation (say, learned from data). We use the standard dense matrixvector multiplication implementation in Numpy (BLAS binding), the FFT implementation from Numpy, and the DCT and DST implementation from Scipy. We compare their speed in singlethreaded mode, running on a server Intel Xeon CPU E52690 v4 at 2.60GHz.
Results are shown in Figure 4.
Appendix D BP Hierarchy
In Definition 1, we defined the notion of a BP hierarchy, which we believes captures a natural class of matrices. To this point, we offer the following conjectures about the expressiveness of this hierarchy, supplementing the inclusion results of Proposition 1.
Conjecture 1.
For every fixed , there is a sufficiently large such that there is an matrix that is in (BP) but not in (BP).
The above conjecture is natural since one needs more parameters to specify a matrix in than . However, one has to be careful as this intuition is not correct for all . In particular, this conjecture asserts that there is a natural containment hierarchy between the BP classes. The condition of “sufficiently large ” is necessary in this conjecture because the space of all possible is exactly the space of all matrices. Therefore, for , BP = (BP) = (BP), .
Conjecture 2.
Let be an matrix such that for any , can be computed with an arithmetic circuit of size and depth . Then, is in (BP).