Butterfly Transform: An Efficient FFT Based Neural Architecture Design
Abstract
In this paper, we introduce the Butterfly Transform (BFT ), a light weight channel fusion method that reduces the computational complexity of pointwise convolutions from of conventional solutions to with respect to the number of channels while improving the accuracy of the networks under the same range of FLOPs. The proposed BFT generalizes the Discrete Fourier Transform in a way that its parameters are learned at training time. Our experimental evaluations show that replacing channel fusion modules with BFT results in significant accuracy gains at similar FLOPs across a wide range of network architectures. For example, replacing channel fusion convolutions with BFT offers absolute top1 improvement for MobileNetV10.25 and for ShuffleNet V20.5 while maintaining the same number of FLOPS. Notably, the ShuffleNetV2+BFT outperforms stateoftheart architecture search methods MNasNet[36] and FBNet [38]. We also show that the structure imposed by BFT has interesting properties that ensures the efficacy of the resulting network.
Butterfly Transform: An Efficient FFT Based Neural Architecture Design
Keivan Alizadeh, Ali Farhadi, Mohammad Rastegari PRIOR @ Allen Institute for AI, University of Washington, XNOR.AI keivan@uw.edu, {ali, mohammad}@xnor.ai
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Devising convolutional neural networks (CNN) that can run efficiently on resourceconstrained edge devices has attracted several researchers. Current stateoftheart efficient architecture designs are mainly structured to reduce the overparameterization of CNNs [25, 16]. A common design choice is to reduce the FLOPs and parameters of a network by factorizing convolutional layers [18, 32, 28, 41], using a separable depthwise convolution, into two components: (1) spatial fusion, where each spatial channel is convolved independently by a depthwise convolution; and (2) channel fusion, where all the spatial channels are linearly combined by convolutions, known as pointwise convolution. During spatial fusion, the network learns features from the spatial planes and during channel fusion the network learns relations between these features across channels. This is often implemented using filters for the spatial fusion, and filters for the channel fusion. Inspecting the computational profile of these networks at inference time reveals that the computational burden of the spatial fusion is relatively negligible compared to that of the channel fusion. In fact, the computational complexity of the pointwise convolutions in the channel fusion is quadratic in the number of channels ( where is the number of channels).
These expensive pointwise convolutions during channel fusion are the main focus of this paper. The pointwise convolutions form a fully connected structure between neurons and can be efficiently implemented using a matrix multiplication. The literature on efficient matrix multiplication suggests imposing a structure over this matrix. Lowrank [22] or circulant [2, 9] structures are few examples of structures that offer efficiencies in matrix multiplication. In the context of representing pointwise convolutions in a neural network, an ideal structure, we argue, should have the following properties. First, the structure should not impose significant limitations on the capacity of the network. In other words, an ideal structure should maintain high information flow through the network; This can be thought of as having a large bottleneck size ^{1}^{1}1The bottleneck size in a network is defined as: the minimum number of nodes that need to be removed to ensure that no path exists between an input and output channel.. Second, the structure should offer efficiency gains; this is often done by minimizing the FLOPS; in our case, this translates into having fewer edges in the network’s graph. Finally, in an ideal network structure, there should be at least one path from every input node to all output nodes. This enables the cross talk across channels during the fusion and without this property some input nodes may not receive crucial signals during the back propagation.
In this paper, we introduce the Butterfly Transform (BFT ), a lightweight channel fusion method with the complexity of with respect to the number of channels. BFT fuses all the channels in layers with operations at each layer. We show that BFT ’s network structure is an optimal structure (in terms of FLOPs) that satisfies all of the aforementioned properties of an ideal channelfusion network. The structure of the BFT network is inspired from the butterfly operations in the Fast Fourier Transform (FFT). These butterfly operations have been heavily optimized in several hardware/software platforms [12, 5, 10] making BFT readily usable in a wide variety of applications.
Our experimental evaluations show that simply replacing the pointwise convolutions with BFT offer significant gain. We have observed that under similar number of FLOPs, the butterfly transform consistently improves the accuracy of the efficient design of the original CNN architectures. For example using BFT in MobileNet v1 0.25 with 37M number of FLOPs get 53.6 top1 accuracy on the imagenet dataset[7] and using BFT in ShuffleNet v2 0.5 with 41M number of FLOPs achieve 61.33 top1 accuracy.
2 Related Work
Deep neural networks suffer from intensive computations. Several approaches have been proposed to address efficient training and inference in deep neural networks.
Efficient CNN architecture designs:
Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [24, 33, 15, 23, 35, 20]. To make these network designs more efficient, they have factorized convolutions into different steps enforcing distinct focuses on spatial and channel fusion [18, 32]. Further, other approaches extended the factorization schema with sparse structure either in channel fusion [28, 41] or spatial fusion [29]. [19] forced more connections between the layers of the network but reduced the computation by desigining smaller layers. Our method follows the same direction of designing a sparse structure on channel fusion that enables lower computation with a minimal loss in accuracy.
Network pruning:
This line of work focuses on reducing the substantial redundant parameters in CNNs by pruning out either neurons or weights [13, 14]. Due to the unstructured sparsity of these models, the learned models from these methods cannot be used efficiently in standard compute platforms such as CPUs and GPUs. Therefore, other approaches in pruning only focus to prune out channels rather than individual neuron or weights [17, 43, 11]. These methods drop a channel either by monitoring the average weight values or average activation values on each channel during the training. Our method is different from these type methods in the way that we enforce a predefined sparse channel structure to begin with and we do not change the structure of the network during the training.
Lowrank network design:
To reduce the computation in CNN, [37, 25, 8, 22] exploit from the fact that CNNs are over parameterized. These models learn a linear low rank representation of the parameters in the network either by post processing the trained weight tensors or by enforcing a linear lowrank structure during the training. There are few works that enforce nonlinear lowrank structure using circulant matrix design [2, 9]. These lowrank network structures achieves efficiency with the cost of lowering the information flow from input channels to the output channels (i.e. they have a few bottleneck nodes) but our butterfly transform is in fact a nonlinear structured lowrank representation that maximizes information flow.
Quantization:
Another approach to improve the efficiency of the deep networks is lowbit representation of network weights and neurons using quantization [34, 30, 39, 4, 42, 21, 1]. These approaches use fewer bits (instead of 32bit highprecision floating points) to represent weights and neurons for the standard training procedure of a network. In the case of extremely low bitwidth (1bit) [30] had to modify the training procedure to find the discrete binary values for the weights and the neurons in the network. Our method is orthogonal to this line of work and these method are complementary to our network.
Neural architecture search:
Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures [44, 40, 31, 45, 36, 27]. These methods search over a huge network space (e.g. MNASNet [36] searches over 8K different design choices) using a dictionary of predefined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually nonhomogeneous, that satisfies optimization constraints, such as inference time. Recent searchbased methods [36, 3, 38] use MobileNetv2 [32] as a basic search block for automatic network design. The main computational bottleneck in most of the search based method is in the channel fusion and our butterfly structure does not exist in any of the predefined blocks of these methods. Our efficient channel fusion can be augmented with these models to further improve the efficiency of these networks. Our experiments shows that our proposed butterfly structure outperforms recent architecture search based models on small network design.
3 Model
In this section, we outline the details of our proposed model. As discussed above, the main computational bottleneck in current efficient neural architecture design is in channel fusion step, which is implemented by a pointwise convolution layer. The input to this layer is a tensor of size , where is the number of channels and , are the width and height respectively. The size of the weight tensor is and the output tensor is . Without loss of generality, we assume . The complexity of a pointwise convolution layer is and this is mainly influenced by the number of channels . We propose a new layer design, Butterfly Transform, that has complexity. This design is inspired by the Fast Fourier Transform (FFT) algorithm, which has been widely used in the computational engines for a variety of applications and there exist optimized hardware/software design for the key operations of this algorithm which are applicable to our method. In the following subsections we explain the problem formulation and the structure of our butterfly transform.
3.1 Pointwise Convolution as MatrixVector Products
A pointwise convolution can be defined as a function as follows:
(1) 
This can be written as a matrix product by reshaping the input tensor to a 2D matrix with size (each column vector in the corresponds to a spatial vector ) and reshaping the weight tensor to a 2D matrix with size ,
(2) 
where is the matrix representation of the output tensor . This can be seen as a linear transformation of the vectors in the columns of using as a transformation matrix. The linear transformation is a matrixvector product and its complexity is . By enforcing structure on this transformation matrix, one can reduce the complexity of the transformation. However, to be effective as a channel fusion transform, it is critical that this transformation respects the desirable characteristics detailed below.
Ideal characteristics of a fusion network:
1) everytoall connectivity: There must be at least one path between every input channel and all of the output channels 2) maximum bottleneck size: The botteleneck size is defined as the minimum number of nodes in the network that if removed, the information flow from input channels to output channels would be completely cut off (i.e. there would be no path from any input channel to any output channel). The largest possible bottleneck size in a multilayer network is . 3) small edge count: To reduce computation, we expect the network to have as few edges as possible. 4) equal outdegree within each layer: To enable efficient matrix implementation of the network, all nodes within each layer must have the same out degree^{2}^{2}2In this way, one can represent each layer as a fixed number of matrix multiplication, which is supported by fast linear algebra libraries (e.g. BLAS), otherwise, we need to use sparse matrix operations which are not as optimized..
Claim: A multilayer network with these properties has at least edges.
Proof: Suppose there exist nodes in layer. Removing all the nodes in one layer will disconnect inputs from outputs. Since the maximum possible bottleneck size is , therefore . Now suppose that out degree of each node at layer is . Number of nodes in layer , which are reachable from an input channel is . Because of the everytoall connectivity, all of the nodes in the output layer are reachable. Therefore . This implies that . The total number of edges will be
In the following section we present a network structure that satisfies all the ideal characteristics of a fusion network.
3.2 Butterfly Transform (BFT)
As mentioned above we can reduce the complexity of a matrixvector product by enforcing structure on the matrix. There are several ways to enforce structure on the matrix. Here we introduce a family of the structured matrix that leads to a complexity of operations and parameters while maintaining the accuracy.
Butterfly Matrix:
We define as a butterfly matrix of order and base where :
(3) 
Where is a butterfly matrices of order and base and is an arbitrary diagonal matrix. The matrixvector product between a butterfly matrix and a vector is :
(4) 
where is a subsection of that is achieved by breaking into equal sized vector. Therefore, the product can be simplified by factoring out as follow:
(5) 
where . Note that is a smaller product between a butterfly matrix of order and a vector of size therefore, we can use divideandconquer to recursively calculate the product . If we consider as the computational complexity of the product between a butterfly matrix and an D vector. From equation 5, the product can be calculated by products of butterfly matrices of order which its complexity is . The complexity of calculating for all is therefore:
(6) 
(7) 
With a smaller choice of we can achieve a lower complexity. Algorithm 1 illustrates the recurcive procedure of a butterfly transform when .
DFT transform is a specific case of BFT:
It can be shown that the Discrete Fourier Transform (DFT) is a specific case of our butterfly transform. In fact, a DFT transform is a butterfly transform such that the transformation matrix . The elements of the output vector is permuted by radix2 shuffle and the diagonal elements of the are drawn from root of unity where . Note that the complexity of the butterfly transform is independent of row permutation. Therefore, the Fast Fourier Transform (FFT) is a specific case of our efficient butterfly transform of order and base , which its complexity is . BFT can be seen as a more general transform than the DFT in a way that its parameters being learned during the training of the network. In the next section we discuss about the network structure of the BFT.
3.3 Butterfly Neural Network
The procedure explained in algorithm 1 can be represented by a butterfly graph similar to the FFT’s graph. The butterfly network structure has been used for function representation [26] and fast factorization for approximating linear transformation [6]. We adopt this graph as an architecture design for the layers of a neural network. Figure 1 illustrates the architecture of a butterfly network of base applied on an input tensor of size . The left figure shows how the recursive structure of the BFT as a network. The right figure shows the constructed multilayer network which has butterfly layers (BFLayer). Note that the complexity of each butterfly layer is ( operations), therefore, the total complexity of the BFT architecture will be . Each butterfly layer can be augmented by batchnormm and nonlinearity functions (e.g. ReLU, Sigmoid). In section 4.2 we study the effect of using different choices of these functions. We found that both batch norm and nonlinear functions (ReLU and Sigmoid) are not effective within BFLayers. Batchnorm is not effective mainly because its complexity is the same as the BFLayer , therefore, it doubles the computation of the entire transform. We use batchnorm only at the end of the transform. The nonlinear activation and zero out almost half of the values in each BFLayer, thus multiplication of these values throughout the forward propagation destroys all the information. The BFLayers can be internally connected with residual connections in different ways. In our experiments, we found that the best residual connections are the one that connect the input of the firs BFLayer to the output of the last BFLayer.
Butterfly network is an optimal structure satisfying all the characteristics of an ideal fusion network. There exist exactly one path between every input channel to all the output channels, the degree of each node in the graph is exactly , the bottleneck size is maximum (), and the number of edges are .
We use the BFT architecture as a replacement of the pointwise convolution layer (convs) in different CNN architectures including MobileNet[18] and ShuffleNet[28]. Our experimental results shows that under the same number of FLOPs, the efficiency gain by BFT is more effective in terms of accuracy compared to the original model with smaller channel rate. We saw consistent accuracy improvement across several architecture settings.
Fusing channels using BFT, instead of pointwiseconvolution reduces the size of the computational bottleneck by a largemargin. Figure 2 illustrate the percentage of the number of operations by each block type throughout a forward pass in the network. Note that when BFT is applied, the percentage of the depthwise convolutions increases by .
4 Experiment
In this section, we demonstrate the performance of the proposed butterfly transform on image classification task. To showcase the strength of our method on designing very small network, we compare performance of Butterfly Transform with pointwise convolutions in two stateoftheart efficient architectures: (1) Mobilenet (2) Shufflenet V2. We also compare our results with other type of transforms that have computation (e.g. lowrank transform and circulant transform). We also show that, in small number of FLOPS (1̃4 M) our method works better than stateofthe art architecture search methods.
4.1 Image Classification
4.1.1 Implementation and dataset details:
Following a common practice, we evaluate the performance of Butterfly Transform on the ImageNet dataset, at different levels of complexity, ranging from 14 MFLOPS to 41 MFLOPs. ImageNet contains 1.2M training samples and 50K validation samples.
For each architecture, we substitute pointwise convolutions with butterfly transforms. To have a fair comparison between BFT and pointwise convolutions we adjust the channel numbers in the architectures (MobileNet and ShuffleNet) such that the number of FLOPs in both methods (BFT and pointwise) remain equal. We optimize our network by minimizing a crossentropy loss using SGD. We found that using no weight decay gives us a better accuracy. We have used . Because it has the same number of FLOPs as base and the corresponding BFT would have less number of BFlayers . Note that if , then the complexity of BFT is the same as pointwise convolution.
Weight initialization:
A common heuristic which is used in weight initialization randomly initializes values in range where . We initialize each edge in a way that multiplication of all the edges on a path from an input node to output node is . We initialize each edge in range .
4.1.2 Mobilenet + BFT
This is the model that we replace BFT with pointwise convs in mobilenet v1. We have compared the Mobilenet with widthmultiplier 0.25 (Using 128, 160, 224 resolutions) and Mobilenet+BFT with widthmultiplier 1.0 (Using 96, 128, 160 resolutions). In table 1(a) we see that using BFT we can outperform pointwise convolutions by 5% in top1 accuracy at 14MFLOPs. Note that Mobilenet+BFT at 24 MFLOPs has almost the same accuracy with Mobilenet at 41MFLOPs, which means it can get same accuracy by almost half of the FLOPs. This has been achieved with not changing architecture at all and by just changing pointwise convolution.
4.1.3 Shufflenet V2 + BFT
Table 1(b) shows the results for Shufflenet V2. We have slightly adjusted the number of output channels to build Shufflenet V21.25. We have compared Shufflenet V21.25+BFT with Shufflenet V20.5. We have compared these two methods in different input resolutions (128, 160, 224) which gives FLOPs ranging from 14M to 41M. Shufflenet V21.25 + BFT achieves about 2.5% better accuracy than our implementation on Shufflenet V20.5 which uses pointwise convolutions. It achieves 1% better accuracy than reported number on Shufflenet V2 at 41MFLOPs.


4.1.4 Comparison with neural architecture search
Our model used in shufflenet v2 can achieve a higher accuracy than the stateoftheart architecture search methods MNasNet[36] and FBNet [38] on efficient network setting ( 14M Flops). This is because, those methods only search among a set of predefined blocks. The most efficient channel fusion block in those methods is the pointwise convolution, which is more expensive than BFT. In table 2(a), we show the top1 accuracy comparison on imagenet dataset. One can further improve the accuracy of the architecture search models by including BFT as a searchable block in those models.
4.1.5 Comparison with other architectures
To further illustrate the benefits of BFLayers, we have compared it with other architectures that reduce the complexity to . Here we study circular [9] and lowrank architectures [22].
Circular architercture: In this architecture the matrix that represents the pointwise convolution is a circulant matrix. In a circulant matrix rows are cyclically shifted of one each other [9]. The product of this circulant matrix by a column can be efficiently computed in using fast fourier transform (FFT).
Lowrank matrix: In this architecture the matrix that represents the pointwise convolution is the product of a matrix and a matrix. Therefore the pointwise convolution can be performed by two consequent small matrix product and the total complexity is


As it can be seen in table 2(b), butterfly transform outperforms both the circulant and the lowrank structures under the same number of FLOPs.
4.2 Ablation Study
Now, we study different elements of our BFT model. As mentioned earlier, residual connections and nonlinear activations can be augmented within our BFLayers. Here we show the performance of these elements in isolation on CIFAR10 dataset using MobileNetv1 as the base network.
With/without nonlinearity: As studied by [32] adding a nonlinearity function like or to a narrow layer (with few channels) reduces the accuracy because it cuts off half of the values of an internal layer to zero. In BFT, the effect of an input channel on an output channel , is determined by the multiplication of all the edges on the path between and . Dropping a value to zero will destroys all the information transferred between the two channels. Dropping half of the values of each internal layer destroys almost all the information in the entire layer. Figure 4 compares the the learning curves of BFT models with and without nonlinear activation functions.
With/without weightdecay: We found that BFT is very sensitive to the weight decay. Because in BFT there is only one path from an input channel to an output channel . The effect of on is determined by the multiplication of all the intermediate edges along the path between and . Pushing all weight values to zero, will significantly reduce the affect of the on . Therefore, weight decay is very destructive in BFT. Figure 3 illustrates the learning curves with and without using weight decay on BFT.
Residual connections: The graphs that obtained by replacing BFTransform with
Model  Accuracy 

No residual  79.2 
EveryotherLayer  81.12 
FirsttoLast  81.75 
pointwise convolutions are very deep. A good practice to train this kind of networks is to add residual connections. We expeimented three different ways of residual connections (1) FirsttoLast, which connects the input of first BFLayer to the output of last BFLayer, (2) EveryotherLayer, which connects every other BFLayers and (3) Noresidual, which there is no residual connection. we found the Firsttolast is the most effective type of residual connection table 3 compares these type of connections.
5 Conclusion
In this paper, we introduce BFT , an efficient transformation that can simply replace pointwise convolutions in various neural architectures to significantly improve the accuracy of the networks while maintaining in the same number of FLOP. The structure employed by BFT has interesting properties that ensures the efficacy of the proposed solution. The main focus of this paper is on small networks that are designed for resourceconstrained devices. BFT shows bigger gains in the smaller networks, and its utility diminishes for larger ones. The butterfly operations in BFT are extremely hardware friendly and there exist several optimized hardware/software platforms for these operations.
References
 [1] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An architecture for ultralow power binaryweight cnn acceleration. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2018.
 [2] Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, and Jamal Atif. On the expressive power of deep fully circulant neural networks. arXiv preprint arXiv:1901.10255, 2019.
 [3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 [4] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 [5] Kenneth Czechowski, Casey Battaglino, Chris McClanahan, Kartik Iyer, PK Yeung, and Richard Vuduc. On the communication complexity of 3d ffts and its implications for exascale. In Proceedings of the 26th ACM international conference on Supercomputing, pages 205–214. ACM, 2012.
 [6] Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterfly factorizations. arXiv preprint arXiv:1903.05895, 2019.
 [7] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
 [8] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
 [9] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. C ir cnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
 [10] Yuri Dotsenko, Sara S Baghsorkhi, Brandon Lloyd, and Naga K Govindaraju. Autotuning of fast fourier transform on graphics processors. In ACM SIGPLAN Notices, pages 257–266. ACM, 2011.
 [11] Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, and Chengzhong Xu. Dynamic channel pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331, 2018.
 [12] Naga K Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008.
 [13] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [16] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
 [17] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
 [18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [19] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In CVPR, 2018.
 [20] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [22] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [24] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pages 396–404, 1990.
 [25] Chong Li and CJ Richard Shi. Constrained optimization based lowrank approximation of deep neural networks. In ECCV, 2018.
 [26] Yingzhou Li, Xiuyuan Cheng, and Jianfeng Lu. Butterflynet: Optimal function representation based on convolutional neural networks. arXiv preprint arXiv:1805.07451, 2018.
 [27] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [28] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
 [29] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Espnetv2: A lightweight, power efficient, and general purpose convolutional neural network. In CVPR, 2019.
 [30] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [31] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2902–2911. JMLR. org, 2017.
 [32] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2014.
 [34] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In NIPS, 2014.
 [35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [36] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [37] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 [38] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
 [39] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
 [40] Lingxi Xie and Alan Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
 [41] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
 [42] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [43] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875–886, 2018.
 [44] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [45] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.