Butterfly Transform: An Efficient FFT Based Neural Architecture Design

Butterfly Transform: An Efficient FFT Based Neural Architecture Design

Keivan Alizadeh, Ali Farhadi, Mohammad Rastegari
PRIOR @ Allen Institute for AI,  University of Washington,  XNOR.AI
keivan@uw.edu, {ali, mohammad}@xnor.ai
Abstract

In this paper, we introduce the Butterfly Transform (BFT ), a light weight channel fusion method that reduces the computational complexity of point-wise convolutions from of conventional solutions to with respect to the number of channels while improving the accuracy of the networks under the same range of FLOPs. The proposed BFT generalizes the Discrete Fourier Transform in a way that its parameters are learned at training time. Our experimental evaluations show that replacing channel fusion modules with BFT results in significant accuracy gains at similar FLOPs across a wide range of network architectures. For example, replacing channel fusion convolutions with BFT offers absolute top-1 improvement for MobileNetV1-0.25 and for ShuffleNet V2-0.5 while maintaining the same number of FLOPS. Notably, the ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet[36] and FBNet [38]. We also show that the structure imposed by BFT has interesting properties that ensures the efficacy of the resulting network.

 

Butterfly Transform: An Efficient FFT Based Neural Architecture Design


  Keivan Alizadeh, Ali Farhadi, Mohammad Rastegari PRIOR @ Allen Institute for AI,  University of Washington,  XNOR.AI keivan@uw.edu, {ali, mohammad}@xnor.ai

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

Devising convolutional neural networks (CNN) that can run efficiently on resource-constrained edge devices has attracted several researchers. Current state-of-the-art efficient architecture designs are mainly structured to reduce the overparameterization of CNNs [25, 16]. A common design choice is to reduce the FLOPs and parameters of a network by factorizing convolutional layers [18, 32, 28, 41], using a separable depth-wise convolution, into two components: (1) spatial fusion, where each spatial channel is convolved independently by a depth-wise convolution; and (2) channel fusion, where all the spatial channels are linearly combined by -convolutions, known as point-wise convolution. During spatial fusion, the network learns features from the spatial planes and during channel fusion the network learns relations between these features across channels. This is often implemented using filters for the spatial fusion, and filters for the channel fusion. Inspecting the computational profile of these networks at inference time reveals that the computational burden of the spatial fusion is relatively negligible compared to that of the channel fusion. In fact, the computational complexity of the point-wise convolutions in the channel fusion is quadratic in the number of channels ( where is the number of channels).

These expensive point-wise convolutions during channel fusion are the main focus of this paper. The point-wise convolutions form a fully connected structure between neurons and can be efficiently implemented using a matrix multiplication. The literature on efficient matrix multiplication suggests imposing a structure over this matrix. Low-rank  [22] or circulant  [2, 9] structures are few examples of structures that offer efficiencies in matrix multiplication. In the context of representing point-wise convolutions in a neural network, an ideal structure, we argue, should have the following properties. First, the structure should not impose significant limitations on the capacity of the network. In other words, an ideal structure should maintain high information flow through the network; This can be thought of as having a large bottleneck size 111The bottleneck size in a network is defined as: the minimum number of nodes that need to be removed to ensure that no path exists between an input and output channel.. Second, the structure should offer efficiency gains; this is often done by minimizing the FLOPS; in our case, this translates into having fewer edges in the network’s graph. Finally, in an ideal network structure, there should be at least one path from every input node to all output nodes. This enables the cross talk across channels during the fusion and without this property some input nodes may not receive crucial signals during the back propagation.

In this paper, we introduce the Butterfly Transform (BFT ), a light-weight channel fusion method with the complexity of with respect to the number of channels. BFT fuses all the channels in layers with operations at each layer. We show that BFT ’s network structure is an optimal structure (in terms of FLOPs) that satisfies all of the aforementioned properties of an ideal channel-fusion network. The structure of the BFT network is inspired from the butterfly operations in the Fast Fourier Transform (FFT). These butterfly operations have been heavily optimized in several hardware/software platforms [12, 5, 10] making BFT readily usable in a wide variety of applications.

Our experimental evaluations show that simply replacing the point-wise convolutions with BFT offer significant gain. We have observed that under similar number of FLOPs, the butterfly transform consistently improves the accuracy of the efficient design of the original CNN architectures. For example using BFT in MobileNet v1 0.25 with 37M number of FLOPs get 53.6 top-1 accuracy on the imagenet dataset[7] and using BFT in ShuffleNet v2 0.5 with 41M number of FLOPs achieve 61.33 top-1 accuracy.

2 Related Work

Deep neural networks suffer from intensive computations. Several approaches have been proposed to address efficient training and inference in deep neural networks.

Efficient CNN architecture designs:

Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [24, 33, 15, 23, 35, 20]. To make these network designs more efficient, they have factorized convolutions into different steps enforcing distinct focuses on spatial and channel fusion [18, 32]. Further, other approaches extended the factorization schema with sparse structure either in channel fusion [28, 41] or spatial fusion [29]. [19] forced more connections between the layers of the network but reduced the computation by desigining smaller layers. Our method follows the same direction of designing a sparse structure on channel fusion that enables lower computation with a minimal loss in accuracy.

Network pruning:

This line of work focuses on reducing the substantial redundant parameters in CNNs by pruning out either neurons or weights [13, 14]. Due to the unstructured sparsity of these models, the learned models from these methods cannot be used efficiently in standard compute platforms such as CPUs and GPUs. Therefore, other approaches in pruning only focus to prune out channels rather than individual neuron or weights [17, 43, 11]. These methods drop a channel either by monitoring the average weight values or average activation values on each channel during the training. Our method is different from these type methods in the way that we enforce a predefined sparse channel structure to begin with and we do not change the structure of the network during the training.

Low-rank network design:

To reduce the computation in CNN, [37, 25, 8, 22] exploit from the fact that CNNs are over parameterized. These models learn a linear low rank representation of the parameters in the network either by post processing the trained weight tensors or by enforcing a linear low-rank structure during the training. There are few works that enforce non-linear low-rank structure using circulant matrix design [2, 9]. These low-rank network structures achieves efficiency with the cost of lowering the information flow from input channels to the output channels (i.e. they have a few bottleneck nodes) but our butterfly transform is in fact a non-linear structured low-rank representation that maximizes information flow.

Quantization:

Another approach to improve the efficiency of the deep networks is low-bit representation of network weights and neurons using quantization [34, 30, 39, 4, 42, 21, 1]. These approaches use fewer bits (instead of 32-bit high-precision floating points) to represent weights and neurons for the standard training procedure of a network. In the case of extremely low bitwidth (1-bit) [30] had to modify the training procedure to find the discrete binary values for the weights and the neurons in the network. Our method is orthogonal to this line of work and these method are complementary to our network.

Neural architecture search:

Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures [44, 40, 31, 45, 36, 27]. These methods search over a huge network space (e.g. MNASNet [36] searches over 8K different design choices) using a dictionary of pre-defined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually non-homogeneous, that satisfies optimization constraints, such as inference time. Recent search-based methods [36, 3, 38] use MobileNetv2 [32] as a basic search block for automatic network design. The main computational bottleneck in most of the search based method is in the channel fusion and our butterfly structure does not exist in any of the predefined blocks of these methods. Our efficient channel fusion can be augmented with these models to further improve the efficiency of these networks. Our experiments shows that our proposed butterfly structure outperforms recent architecture search based models on small network design.

3 Model

In this section, we outline the details of our proposed model. As discussed above, the main computational bottleneck in current efficient neural architecture design is in channel fusion step, which is implemented by a point-wise convolution layer. The input to this layer is a tensor of size , where is the number of channels and , are the width and height respectively. The size of the weight tensor is and the output tensor is . Without loss of generality, we assume . The complexity of a point-wise convolution layer is and this is mainly influenced by the number of channels . We propose a new layer design, Butterfly Transform, that has complexity. This design is inspired by the Fast Fourier Transform (FFT) algorithm, which has been widely used in the computational engines for a variety of applications and there exist optimized hardware/software design for the key operations of this algorithm which are applicable to our method. In the following subsections we explain the problem formulation and the structure of our butterfly transform.

3.1 Point-wise Convolution as Matrix-Vector Products

A point-wise convolution can be defined as a function as follows:

(1)

This can be written as a matrix product by reshaping the input tensor to a 2-D matrix with size (each column vector in the corresponds to a spatial vector ) and reshaping the weight tensor to a 2-D matrix with size ,

(2)

where is the matrix representation of the output tensor . This can be seen as a linear transformation of the vectors in the columns of using as a transformation matrix. The linear transformation is a matrix-vector product and its complexity is . By enforcing structure on this transformation matrix, one can reduce the complexity of the transformation. However, to be effective as a channel fusion transform, it is critical that this transformation respects the desirable characteristics detailed below.

Ideal characteristics of a fusion network:

1) every-to-all connectivity: There must be at least one path between every input channel and all of the output channels 2) maximum bottleneck size: The botteleneck size is defined as the minimum number of nodes in the network that if removed, the information flow from input channels to output channels would be completely cut off (i.e. there would be no path from any input channel to any output channel). The largest possible bottleneck size in a multi-layer network is . 3) small edge count: To reduce computation, we expect the network to have as few edges as possible. 4) equal out-degree within each layer: To enable efficient matrix implementation of the network, all nodes within each layer must have the same out degree222In this way, one can represent each layer as a fixed number of matrix multiplication, which is supported by fast linear algebra libraries (e.g. BLAS), otherwise, we need to use sparse matrix operations which are not as optimized..

Claim: A multi-layer network with these properties has at least edges.

Proof: Suppose there exist nodes in layer. Removing all the nodes in one layer will disconnect inputs from outputs. Since the maximum possible bottleneck size is , therefore . Now suppose that out degree of each node at layer is . Number of nodes in layer , which are reachable from an input channel is . Because of the every-to-all connectivity, all of the nodes in the output layer are reachable. Therefore . This implies that . The total number of edges will be

In the following section we present a network structure that satisfies all the ideal characteristics of a fusion network.

3.2 Butterfly Transform (BFT)

As mentioned above we can reduce the complexity of a matrix-vector product by enforcing structure on the matrix. There are several ways to enforce structure on the matrix. Here we introduce a family of the structured matrix that leads to a complexity of operations and parameters while maintaining the accuracy.

Butterfly Matrix:

We define as a butterfly matrix of order and base where :

(3)

Where is a butterfly matrices of order and base and is an arbitrary diagonal matrix. The matrix-vector product between a butterfly matrix and a vector is :

(4)

where is a subsection of that is achieved by breaking into equal sized vector. Therefore, the product can be simplified by factoring out as follow:

(5)

where . Note that is a smaller product between a butterfly matrix of order and a vector of size therefore, we can use divide-and-conquer to recursively calculate the product . If we consider as the computational complexity of the product between a butterfly matrix and an -D vector. From equation 5, the product can be calculated by products of butterfly matrices of order which its complexity is . The complexity of calculating for all is therefore:

(6)
(7)

With a smaller choice of we can achieve a lower complexity. Algorithm 1 illustrates the recurcive procedure of a butterfly transform when .

1 Function ButterflyTransform(W, X, n): /* algorithm as a recursive function */
      Data: W
      Weights containing numbers
       Data: X
       An input containing numbers
2       if n == 1 then
3             return [X] ;
4            
5       Make using first numbers of ;
6       Split rest numbers to two sequences with length each.;
7       Split to ;
8       ;
9       ;
10       ;
11       ;
12       return Concat();
13      
14
Algorithm 1 Recursive butterfly transform
DFT transform is a specific case of BFT:

It can be shown that the Discrete Fourier Transform (DFT) is a specific case of our butterfly transform. In fact, a DFT transform is a butterfly transform such that the transformation matrix . The elements of the output vector is permuted by radix-2 shuffle and the diagonal elements of the are drawn from root of unity where . Note that the complexity of the butterfly transform is independent of row permutation. Therefore, the Fast Fourier Transform (FFT) is a specific case of our efficient butterfly transform of order and base , which its complexity is . BFT can be seen as a more general transform than the DFT in a way that its parameters being learned during the training of the network. In the next section we discuss about the network structure of the BFT.

Figure 1: BFT Architecture: This figure illustrates the graph structure of the proposed butterfly transform. The left figure shows the recursive procedure of the BFT that is applied to an input tensor and the left figure shows the expanded version of the recursive procedure as butterfly layers in the network.

3.3 Butterfly Neural Network

The procedure explained in algorithm 1 can be represented by a butterfly graph similar to the FFT’s graph. The butterfly network structure has been used for function representation [26] and fast factorization for approximating linear transformation [6]. We adopt this graph as an architecture design for the layers of a neural network. Figure  1 illustrates the architecture of a butterfly network of base applied on an input tensor of size . The left figure shows how the recursive structure of the BFT as a network. The right figure shows the constructed multi-layer network which has butterfly layers (BFLayer). Note that the complexity of each butterfly layer is ( operations), therefore, the total complexity of the BFT architecture will be . Each butterfly layer can be augmented by batchnormm and non-linearity functions (e.g. ReLU, Sigmoid). In section 4.2 we study the effect of using different choices of these functions. We found that both batch norm and nonlinear functions (ReLU and Sigmoid) are not effective within BFLayers. Batchnorm is not effective mainly because its complexity is the same as the BFLayer , therefore, it doubles the computation of the entire transform. We use batchnorm only at the end of the transform. The non-linear activation and zero out almost half of the values in each BFLayer, thus multiplication of these values throughout the forward propagation destroys all the information. The BFLayers can be internally connected with residual connections in different ways. In our experiments, we found that the best residual connections are the one that connect the input of the firs BFLayer to the output of the last BFLayer.

Butterfly network is an optimal structure satisfying all the characteristics of an ideal fusion network. There exist exactly one path between every input channel to all the output channels, the degree of each node in the graph is exactly , the bottleneck size is maximum (), and the number of edges are .

We use the BFT architecture as a replacement of the point-wise convolution layer (-convs) in different CNN architectures including MobileNet[18] and ShuffleNet[28]. Our experimental results shows that under the same number of FLOPs, the efficiency gain by BFT is more effective in terms of accuracy compared to the original model with smaller channel rate. We saw consistent accuracy improvement across several architecture settings.

Fusing channels using BFT, instead of pointwise-convolution reduces the size of the computational bottleneck by a large-margin. Figure 2 illustrate the percentage of the number of operations by each block type throughout a forward pass in the network. Note that when BFT is applied, the percentage of the depth-wise convolutions increases by .

4 Experiment

In this section, we demonstrate the performance of the proposed butterfly transform on image classification task. To showcase the strength of our method on designing very small network, we compare performance of Butterfly Transform with point-wise convolutions in two state-of-the-art efficient architectures: (1) Mobilenet (2) Shufflenet V2. We also compare our results with other type of transforms that have computation (e.g. low-rank transform and circulant transform). We also show that, in small number of FLOPS (1̃4 M) our method works better than state-of-the art architecture search methods.

4.1 Image Classification

4.1.1 Implementation and dataset details:

Following a common practice, we evaluate the performance of Butterfly Transform on the ImageNet dataset, at different levels of complexity, ranging from 14 MFLOPS to 41 MFLOPs. ImageNet contains 1.2M training samples and 50K validation samples.

For each architecture, we substitute point-wise convolutions with butterfly transforms. To have a fair comparison between BFT and point-wise convolutions we adjust the channel numbers in the architectures (MobileNet and ShuffleNet) such that the number of FLOPs in both methods (BFT and point-wise) remain equal. We optimize our network by minimizing a cross-entropy loss using SGD. We found that using no weight decay gives us a better accuracy. We have used . Because it has the same number of FLOPs as base and the corresponding BFT would have less number of BFlayers . Note that if , then the complexity of BFT is the same as point-wise convolution.

Weight initialization:

A common heuristic which is used in weight initialization randomly initializes values in range where . We initialize each edge in a way that multiplication of all the edges on a path from an input node to output node is . We initialize each edge in range .

Figure 2: Distribution of FLOPs: This figure shows that replacing the point-wise convolution with BFT reduces the size of the computational bottleneck.

4.1.2 Mobilenet + BFT

This is the model that we replace BFT with pointwise convs in mobilenet v1. We have compared the Mobilenet with width-multiplier 0.25 (Using 128, 160, 224 resolutions) and Mobilenet+BFT with width-multiplier 1.0 (Using 96, 128, 160 resolutions). In table 1(a) we see that using BFT we can outperform pointwise convolutions by 5% in top-1 accuracy at 14MFLOPs. Note that Mobilenet+BFT at 24 MFLOPs has almost the same accuracy with Mobilenet at 41MFLOPs, which means it can get same accuracy by almost half of the FLOPs. This has been achieved with not changing architecture at all and by just changing point-wise convolution.

4.1.3 Shufflenet V2 + BFT

Table 1(b) shows the results for Shufflenet V2. We have slightly adjusted the number of output channels to build Shufflenet V2-1.25. We have compared Shufflenet V2-1.25+BFT with Shufflenet V2-0.5. We have compared these two methods in different input resolutions (128, 160, 224) which gives FLOPs ranging from 14M to 41M. Shufflenet V2-1.25 + BFT achieves about 2.5% better accuracy than our implementation on Shufflenet V2-0.5 which uses pointwise convolutions. It achieves 1% better accuracy than reported number on Shufflenet V2 at 41MFLOPs.

Flops Mobilenet Mobilenet+BFT
10-15 M 41.5 (14 M) 46.58 (14 M)
20-25 M 45.5 (21 M) 49.88 (24 M)
34-42 M
47.7 (34 M)
50.6 (41 M)
53.62 (37 M)
(a) MobileNet-V1
Flops Shufflenet Shufflenet+BFT
10-15 M 50.86 (14 M) 54.9 (14 M)
20-25 M 55.21 (21 M) 57.83 (21 M)
34-42 M
58.9(41 M) (our impl.)
60.3 (41 M)
61.33 (41 M)
(b) ShuffleNet-V2
Table 1: This table compares the top-1 accuracy on ImageNet using MobileNet-v1 and ShuffleNet-v2 when replacing the point-wise convolution with butterfly transform

4.1.4 Comparison with neural architecture search

Our model used in shufflenet v2 can achieve a higher accuracy than the state-of-the-art architecture search methods MNasNet[36] and FBNet [38] on efficient network setting ( 14M Flops). This is because, those methods only search among a set of predefined blocks. The most efficient channel fusion block in those methods is the point-wise convolution, which is more expensive than BFT. In table 2(a), we show the top-1 accuracy comparison on imagenet dataset. One can further improve the accuracy of the architecture search models by including BFT as a searchable block in those models.

4.1.5 Comparison with other architectures

To further illustrate the benefits of BFLayers, we have compared it with other architectures that reduce the complexity to . Here we study circular [9] and low-rank architectures [22].

Circular architercture: In this architecture the matrix that represents the point-wise convolution is a circulant matrix. In a circulant matrix rows are cyclically shifted of one each other [9]. The product of this circulant matrix by a column can be efficiently computed in using fast fourier transform (FFT).

Low-rank matrix: In this architecture the matrix that represents the point-wise convolution is the product of a matrix and a matrix. Therefore the point-wise convolution can be performed by two consequent small matrix product and the total complexity is

Figure 3: Effect of weight-decay
Figure 4: Effect of activations
Figure 5: In BFT we should not enforce weight decay, because it significantly reduces the effect of input channels on output channels. Similarly, we should not apply the common non-linear activations. Because these functions zero out almost half of the values in the intermediate BFLayer that leads to significant drop in the information flow from input channels to the output channels
Model Accuracy
Shufflenet-V2 + BFT (14 M) 54.9
FBNet-96-0.35-1 (12.9 M) 50.2
FBNet-96-0.35-2 (13.7 M) 51.9
MNasNet (12.7 M) 49.3
(a) BFT vs. Architecture Search
Model Accuracy
Mobilenet (42 M) 50.6
Mobilenet+Circulant (42 M) 35.68
Mobilenet+low-rank (37 M) 43.78
Mobilenet+BFT (37 M) 53.6
(b) BFT vs. Low-rank & Circulant
Table 2: These tables compare BFT with other efficient network design approaches. Our BFT model augmented with shufflenetV2 outperfoms the state-of-the-art neaural architecture search methods (MNasNet [36], FBNet[38]). Also BFT achieves higher accuracy compared to efficient network design by low-rank transform and circulant matrix transform.

As it can be seen in table 2(b), butterfly transform outperforms both the circulant and the low-rank structures under the same number of FLOPs.

4.2 Ablation Study

Now, we study different elements of our BFT model. As mentioned earlier, residual connections and non-linear activations can be augmented within our BFLayers. Here we show the performance of these elements in isolation on CIFAR-10 dataset using MobileNetv1 as the base network.

With/without non-linearity: As studied by [32] adding a non-linearity function like or to a narrow layer (with few channels) reduces the accuracy because it cuts off half of the values of an internal layer to zero. In BFT, the effect of an input channel on an output channel , is determined by the multiplication of all the edges on the path between and . Dropping a value to zero will destroys all the information transferred between the two channels. Dropping half of the values of each internal layer destroys almost all the information in the entire layer. Figure 4 compares the the learning curves of BFT models with and without non-linear activation functions.

With/without weight-decay: We found that BFT is very sensitive to the weight decay. Because in BFT there is only one path from an input channel to an output channel . The effect of on is determined by the multiplication of all the intermediate edges along the path between and . Pushing all weight values to zero, will significantly reduce the affect of the on . Therefore, weight decay is very destructive in BFT. Figure 3 illustrates the learning curves with and without using weight decay on BFT.

Residual connections: The graphs that obtained by replacing BFTransform with

Model Accuracy
No residual 79.2
Every-other-Layer 81.12
First-to-Last 81.75
Table 3: Residual connections

point-wise convolutions are very deep. A good practice to train this kind of networks is to add residual connections. We expeimented three different ways of residual connections (1) First-to-Last, which connects the input of first BFLayer to the output of last BFLayer, (2) Every-other-Layer, which connects every other BFLayers and (3) No-residual, which there is no residual connection. we found the First-to-last is the most effective type of residual connection table 3 compares these type of connections.

5 Conclusion

In this paper, we introduce BFT , an efficient transformation that can simply replace point-wise convolutions in various neural architectures to significantly improve the accuracy of the networks while maintaining in the same number of FLOP. The structure employed by BFT has interesting properties that ensures the efficacy of the proposed solution. The main focus of this paper is on small networks that are designed for resource-constrained devices. BFT shows bigger gains in the smaller networks, and its utility diminishes for larger ones. The butterfly operations in BFT are extremely hardware friendly and there exist several optimized hardware/software platforms for these operations.

References

  • [1] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An architecture for ultralow power binary-weight cnn acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
  • [2] Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, and Jamal Atif. On the expressive power of deep fully circulant neural networks. arXiv preprint arXiv:1901.10255, 2019.
  • [3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
  • [4] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830, 2016.
  • [5] Kenneth Czechowski, Casey Battaglino, Chris McClanahan, Kartik Iyer, P-K Yeung, and Richard Vuduc. On the communication complexity of 3d ffts and its implications for exascale. In Proceedings of the 26th ACM international conference on Supercomputing, pages 205–214. ACM, 2012.
  • [6] Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterfly factorizations. arXiv preprint arXiv:1903.05895, 2019.
  • [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [8] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
  • [9] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
  • [10] Yuri Dotsenko, Sara S Baghsorkhi, Brandon Lloyd, and Naga K Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In ACM SIGPLAN Notices, pages 257–266. ACM, 2011.
  • [11] Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic channel pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331, 2018.
  • [12] Naga K Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008.
  • [13] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [16] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [17] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • [18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [19] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In CVPR, 2018.
  • [20] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
  • [22] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  • [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [24] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
  • [25] Chong Li and CJ Richard Shi. Constrained optimization based low-rank approximation of deep neural networks. In ECCV, 2018.
  • [26] Yingzhou Li, Xiuyuan Cheng, and Jianfeng Lu. Butterfly-net: Optimal function representation based on convolutional neural networks. arXiv preprint arXiv:1805.07451, 2018.
  • [27] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
  • [28] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  • [29] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In CVPR, 2019.
  • [30] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  • [31] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2902–2911. JMLR. org, 2017.
  • [32] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
  • [34] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In NIPS, 2014.
  • [35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [36] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
  • [37] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
  • [38] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
  • [39] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
  • [40] Lingxi Xie and Alan Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
  • [41] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
  • [42] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  • [43] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875–886, 2018.
  • [44] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  • [45] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
372964
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description