Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
Abstract
Sparsity helps reduce the computational complexity of deep neural networks by skipping zeros. Taking advantage of sparsity is listed as a high priority in the next generation DNN accelerators such as TPUJouppi2017 (). The structure of sparsity, i.e., the granularity of pruning, affects the efficiency of hardware accelerator design as well as the prediction accuracy. Coarsegrained pruning brings more regular sparsity patterns, making it more amenable for hardware acceleration, but more challenging to maintain the same accuracy. In this paper we quantitatively measure the tradeoff between sparsity regularity and the prediction accuracy, providing insights in how to maintain the accuracy while having more structured sparsity pattern. Our experimental results show that coarsegrained pruning can achieve similar sparsity ratio as unstructured pruning given no loss of accuracy. Moreover, due to the index saving effect, coarsegrained pruning is able to obtain better compression ratio than finegrained sparsity at the same accuracy threshold. Based on the recent sparse convolutional neural network accelerator (SCNN), our experiments further demonstrate that coarsegrained sparsity saves of the memory references compared with finegrained sparsity. Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
Huizi Mao^{1}, Song Han^{1}, Jeff Pool^{2}, Wenshuo Li^{3}, Xingyu Liu^{1}, Yu Wang^{3}, William J. Dally^{1,2} ^{1}Stanford University ^{2}NVIDIA ^{3}Tsinghua University {huizi,songhan,dally}@stanford.edu
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
Deep Neural Networks (DNNs) have many parameters, which leads to problems related to storage, computation and energy cost. Stateofart Convolutional Neural Network (CNN) models have hundreds of millions parameters and take tens of billions operationshe2016deep (); krizhevsky2012imagenet (); simonyan2014very (). That makes DNN models difficult to deploy on embedded systems with limited resources.
To deal with this problem, various methods have been proposed to compress DNN models and reduce the amount of computation. Some methods are based on decomposition and factorizationzhang2016accelerating (); lebedev2014speeding (). These methods can preserve the regular dense computation structure of the original models, thus are able to to achieve both compression and acceleration on generalpurpose processors. Pruning serves as another effective method to greatly reduce the number of parameters with no loss of accuracyhan2015learning (); guo2016dynamic ().
Pruning based methods are better at preserving accuracy as well as achieving higher compression rateshan2015learning (). However, such improvements come at the cost of the irregularity of the sparse computation pattern. On the other side, structured pruning, such as pruning entire filters will cause larger accuracy loss than pruning individual weightsli2016pruning (). Those observations pose several questions: What is the tradeoff between regularity and accuracy? Is it possible to find a sweet spot in the range of regularity? How does the sweet spot improve the efficiency of hardware implementation?
We attempt to answer those questions by looking into pruning with different granularity, as shown in Figure 1. There are existing works trying to prune filters or channels instead of individual weightsMolchanovPruningICLR17 (); wen2016learning (); anwar2015structured (). However, they are individual points in the design space. Due to the various methods they used, we cannot directly evaluate the relationship between pruning granularity and final accuracy. We therefore apply the exact same method and experimental setting for an effective comparison. We also want to explore a consistent range of granularity, which includes intermediate grain size like 2D kernels and 1D subkernel vectors. Based on a thorough space exploration, we are able to analyze the storage saving and hardware efficiency at different granularity of sparsity.
In this work, we make the following contributions:

We explore a complete range of pruning granularity and evaluate the tradeoff between the model’s regularity and accuracy.

We demonstrate that coarsegrained pruning is able to reach similar or even better compression rates than the finegrained one, even though it obtains less sparsity.

We show that coarsegrained sparsity is able to skip computations and reduce memory references in a structured manner, which leads to more efficient hardware accelerator implementation.
2 Related Works
Methods of pruning. Sparsity has been proven as an effective approach to save parameters of Deep Neural Network modelshan2015learning (); guo2016dynamic (). A number of works have investigated how to select the important connections and effectively recover accuracy. Secondorder derivativelecun1989optimal (), absolute valuehan2015learning (), lossapproximating Taylor expansionMolchanovPruningICLR17 (), and output sensitivityengelbrecht2001new () are examples of importance metrics used for pruning. There are also methods trying to better integrate pruning and training, like iterative pruninghan2015learning () and dynamic pruningguo2016dynamic ().
Granularity of sparsity. Among all types of sparsity, finegrained sparsity (vanilla sparsity) and filterwise sparsity (very coarsegrained sparsity) are two extreme cases that has been studiedhan2015learning (); li2016pruning (). Finegrained sparsity is a type of sparsity in which individual weights are deleted and was first proposed in 1989 by LeCun et al.lecun1989optimal (). Finegrained sparsity has been proven to work well on a wide range of popular neural network models of CNN and RNNhan2015learning (); guo2016dynamic (); giles1994pruning (); han2017ese (). There is also channel reduction and filter reduction, which reduce the dimension of input/output features as well as layers. Channel reduction can be viewed as very coarsegrained sparsity that removes 3dimensional subtensors in convolutional layers. Such coarsegrained sparsity is beneficial for acceleration due to regularitylebedev2016fast (); wen2016learning (). However, it usually causes notable reduced accuracy compared with finegrained sparsity, as indicated by Li et al.li2016pruning ().
There is a large range of granularity between vanilla sparsity and channel reduction. Some literature attempts to explore one or a few possibilities among all choices. Intrakernel strided pruning is one case investigated in the work of Anwar et al.anwar2015structured ().
Accelerating sparse models. For very coarsegrained sparsity like filtersparsity and channelsparsity, it is simple to achieve acceleration on generalpurpose processors because it is equivalent to obtaining a smaller dense modelwen2016learning (). For finegrained sparsity, custom acceleratorshan2016eie (); parashar2017scnn () have been proposed to exploit the reduction of computations.
3 Granularity of Sparsity
3.1 Notations
To simplify descriptions, we use the following notations for CNN. In a single convolutional layer, the weights compose a 4dimensional tensor of shape . is the output dimension, i.e., the number of output feature maps. is the input dimension. and are the shape of convolution kernels (=3, =3 for a 3x3 kernel).
One layer’s weights consist of multiple filters (3dimensional tensor of shape ), each one associated with an output feature map. The weights can also be viewed as multiple channels (3dimensional tensor ), each one associated with an input feature map. Filters and channels are both composed of kernels (2dimensional tensor ), which are the key element in the 2d convolution operation. Subkernel vectors (1dimensional tensor of size or ) and scalar weights(0dimensional tensor) are lowerlevel elements in a convolutional layer. Figure 3 illustrates these concepts.
3.2 Range of Granularity
It has been stated that finegrained sparsity and channel reduction are two extreme cases of granularity. Among all possible grain sizes, there are a large variety of choices.
In this paper, we investigate granularity where the grain size increases with the number of dimensions. To be specific, we study 4 cases where the atomic elements(grain) during pruning are scalar weights (0D), subkernel vectors (1D), kernels (2D) and filters (3D). We explain these cases with numpylike pseudo codes as below. Figure 3 also illustrates the different granularities of sparsity.
We select these four cases because they can be and have already been mapped into simple computation logics. Finegrained sparsity breaks the tensor into disjoint weights, therefore finegrained multiplicationaccumulation are required. Its implementations has been provided in EIEhan2016eie () and SCNNparashar2017scnn (). Subkernel Vector sparsity can be mapped into 1D convolution. Though the sparse case has yet been studied, Eyerisschen2016eyeriss () deals with the dense case and treats 1D convolution as primitive operation. Kernel sparsity is related with 2D convolution, which is primitive operations in a variety of algorithms and platforms, like Winograd convolution in cuDNN^{1}^{1}1Nvidia developer’s blog and some FPGA implementationsqiu2016going (). Channel sparsity is equivalent to model reduction and, thus, is easy for acceleration in all types of platforms.
3.3 Coarsegrained Pruning Method
Coarsegrained pruning deletes multiple weights together instead of individual weights. Typically the grain can be filters, kernels, subkernels vectors or anything else. Because we are interested in the effects of the grain size rather than the pruning method, we adopt the simple magnitudebased pruning criterion in han2015learning (). For a grain that consists of multiple weights, the Salience is defined as the sum of absolute values, i.e. the L1 norm in Equation 1. Given the targeted sparsity, say we want 30% of the weights to be zero, we sort the grains of weights according to the L1 norm defined in Equation 1. The grains with the smallest 70% L1norm are deleted.
(1) 
We also adopt the iterative pruning method proposed by Han et al.han2015learning (). It is able to reach higher sparsity than direct pruning. The sparsity during each pruning stage is determined by sensitivity analysis, which requires individually pruning every layer and measure the accuracy loss on the training dataset.
4 SparsityAccuracy Relation with Different Grain Sizes
Our goal is to study how the granularity of pruning influences the accuracy. Specifically, we want to compare the accuracy of different pruning granularities at the same sparsity, and the result is shown in Figure 4. Sparsity serves as a regularizer because it lowers the model capacity by reducing the number of parameters. Coarsegrained sparsity not only reduces the number of parameters but also constrains the positions of parameters, which is an even stronger regularizer. That’s why at low sparsity rate we observed the accuracy improvement in Figure 4.
To ensure fair comparison, we enforce the same sparsity setting and training schedule for the same model. All experiments were performed on the ImageNet datasetdeng2009imagenet () with Caffejia2014caffe ().
For CNN models, we only count the overall sparsity of convolutional layers since convolutional layers take up most of the computations in a typical CNN modelbagherinezhad2016lcnn (). However, we still prune the fullyconnected layers together with convolutional layers to obtain consistent comparisons with previous workshan2015learning (); guo2016dynamic (). For fc layers, we only use finegrained pruning because there’s no such hierarchy of pruning granularity in the fc layer.
We experimented on AlexNet for detailed accuracysparsity curves. We also experimented with modern CNNs, including VGG16simonyan2014very (), GoogLeNetszegedy2015going (), ResNet50he2016deep (), and DenseNet121huang2016densely () and compared their accuracies at the samesparsity point. Their results are reported and compared in Table 1.
Figure 4 shows the accuracy curve of density(1sparsity) under various settings. In this figure there are four different types granularity of sparsity; in each case the atomic element for pruning is

Finegrained(0Dim): Individual weights.

Vector(1Dim) : Subkernel vectors of size .

Kernel(2Dim): Kernels of shape .

Filter(3Dim): Filters of shape .
When the grain size of pruning is very large, say, filters, we observed huge accuracy loss during iterative pruning. AlexNet loses nearly 1% validation accuracy at the very first pruning stage, which implies it is unsuitable for lossless model compression. For finergrained pruning, the accuracy loss is much smaller; we even noticed small accuracy increases during the first several pruning stages. Note that the results for AlexNet are better than the original work by Han et al.han2015learning () due to a smoother pruning process. We give a detailed description in Section 7.
Model  Density  Granularity  Top5 

AlexNet  24.8%  Kernel Pruning (2D)  79.20% 
Vector Pruning (1D)  79.94%  
Finegrained Pruning (0D)  80.41%  
VGG16  23.5%  Kernel Pruning (2D)  89.70% 
Vector Pruning (1D)  90.48%  
Finegrained Pruning (0D)  90.56%  
GoogLeNet  38.4%  Kernel Pruning (2D)  88.83% 
Vector Pruning (1D)  89.11%  
Finegrained Pruning (0D)  89.40%  
ResNet50  40.0%  Kernel Pruning (2D)  92.07% 
Vector Pruning (1D)  92.26%  
Finegrained Pruning (0D)  92.34%  
DenseNet121  30.1%  Kernel Pruning (2D)  91.56% 
Vector Pruning (1D)  91.89%  
Finegrained Pruning (0D)  92.21% 
The results in Table 1 and Figure 4 support the assumption that coarsegrained sparsity causes greater accuracy loss than finegrained sparsity. Pruning with a large grain size like filters will greatly hurt accuracy. On the other hand, pruning with a smaller grain size leads to similar accuracysparsity curves with finegrained pruning. Notice that in Figure 4, some curves appear to rise smoothly at first. That suggests coarsegrained pruning can still reach similar compression rates as finegrained pruning, giving additional advantages described in the following section.
5 Comparison of Storage
Model size is an important factor for realworld mobile applications. On the one hand, it constrains the application in memorybounded devices. On the other hand, memory access is more than two orders of magnitude more energy expensive during the execution of deep neural networkhan2015learning (). Sparsity serves as an effective approach to compress neural network models. Sparse neural networks are usually stored in a similar format to Compressed Row Storage(CRS), where both values and indices are stored. Coarsegrained sparsity, due to its regularity, is able to save the number of indices as illustrated in Figure 6. Therefore the coarsegrained sparse models take up less storage than finegrained models at the same sparsity.
We want to investigate how the prediction accuracy changes with different grain sizes of pruning at the same level of storage(instead of sparsity). We do not use fullprecision 32bit weights, but 8bit weights instead, as 8bit weights, either true 8bit integer formats or 8bit indices indexing to a table of shared fp32 weights, have been proven to be sufficient in a lot of literatureJouppi2017 (); han2016eie (); vanhoucke2011improving (). We use 4bit indices to store the distances between adjacent nonzeros, following the method in Deep Compression han2015deep (). Moreover, as indicated in Deep Compression, the quantization method works independently with sparsity. To check if it still works with coarsegrained sparsity, we plot the accuracybits curves of different types of pruned models in Figure 6. The results show that sparsity structure has negligible influence over quantization.
Figure 7 shows the accuracystorage relationship of AlexNet. We find that the first three curves(Finegrained, Vector and Kernel) are closer than those in Figure 4. This figure shows the effect of index saving for coarsegrained pruning.
To better compare the compression ratio with modern deep neural nets, we list the results of AlexNet, VGG16 and GoogLeNet, ResNet50 and DenseNet121 in Table 2. Here the storage ratio is defined as the model size of pruned 8bit models(with 4bit indices) to that of dense 8bit models. Note that it is almost impossible to prune a model that exactly matches the baseline accuracy, so we use linear interpolation to obtain the estimated density and storage ratio at a given point of accuracy.
Model  Top5  Granularity  Density  Storage Ratio 
Accuracy  
AlexNet  80.3%  Kernel Pruning (2D)  37.8%  39.7% 
Vector Pruning (1D)  29.9%  34.5%  
Finegrained Pruning (0D)  22.1%  33.0%  
VGG16  90.6%  Kernel Pruning (2D)  44.4%  46.9% 
Vector Pruning (1D)  30.7%  35.8%  
Finegrained Pruning (0D)  27.0%  40.6%  
GoogLeNet  89.0%  Kernel Pruning (2D)  43.7%  51.6% 
Vector Pruning (1D)  36.9%  47.4%  
Finegrained Pruning (0D)  32.3%  48.5%  
ResNet50  92.3%  Kernel Pruning (2D)  61.3%  77.0% 
Vector Pruning (1D)  40.0%  52.7%  
Finegrained Pruning (0D)  37.1%  55.7%  
DenseNet121  91.9%  Kernel Pruning (2D)  35.5%  48.9% 
Vector Pruning (1D)  31.1%  43.8%  
Finegrained Pruning (0D)  26.6%  39.8% 
For a sparse network, the larger the grain size is, the less storage needed. This is due to index sharing among the weights of the kernel as shown in Figure 6. However, AlexNet and VGG16 in particular have much closer density/storage results for kernel pruning than GoogLeNet, ResNet, and DenseNet. This is caused by the small size of the convolutional kernels being pruned: these networks have many layers of 1x1 convolutions, which do not benefit from sharing index values. AlexNet and VGG16, on the other hand, have a multitude of larger convolutions.
6 Regular Sparsity Helps Hardware Implementation
It has been mentioned in the previous sections that filter pruning is able to obtain acceleration on generalpurpose processors like CPUs or GPUs. For intermediate grain sizes like kernels or subkernel vectors, though it is still difficult for acceleration on generalpurpose processors, there are several advantages over finegrained sparsity on customized hardware. Those advantages enable simpler circuit design and higher energy efficiency on customized hardware. We qualitatively and quantitatively analyze the advantages as follows:
Qualitative analysis. In convolutional layers, 2D convolution is usually the primitive operation. Kernel pruning (2D pruning) can easily lead to computation reduction, because the 2D convolutions of deleted kernels can be saved. Recent custom hardware design for CNN also use 1D convolution as the primitive operationchen2016eyeriss (). In this case, subkernel vector pruning is beneficial. Compared with finegrained sparsity, coarsegrained sparsity is able to preserve the lowlevel computation logic, therefore simplifying the hardware design.
Quantitative analysis. Memory reference is a major factor of energy consumptionhan2015learning (). Recent work on custom hardware exploits both the sparsity of weights and activations of CNNsparashar2017scnn (). In their implementation, the weights and input activations are both stored in sparse format while output activations are stored in dense format. The indices of weights and activations are used for calculating the output address, to which the product of weight and activation will perform a scatteradd operation. This process is illustrated in Figure8. After one layer is finished, the output activations will then be compressed into the sparse format for next layer.
If the same output address is referenced again, a data shortcut can be used to avoid the expensive read/write. For example, two adjacent weights and two adjacent activations will reference 3 addresses instead of 4. Due to the locality, coarsegrained sparse weights have a larger probability of output address collision. We simulated with VGG16 on ImageNet’s validation set to compare the number of memory references and listed the results in Table 3. With the same density, coarsegrained sparsity saves of the total output memory references.
Density  Finegrained  Vector Pruning  Relative # of memory references 

(0D)  (1D)  
40.1%  1.77B  1.23B  69.5% 
33.1%  1.53B  1.03B  67.2% 
27.5%  1.33B  0.87B  65.3% 
7 Summary
In this section we compare our results with previous works on pruninghan2015learning (); guo2016dynamic (). We select AlexNet, as its layerwise sparsity is published in previous papers. By using a smoother pruning process, we find the results of Han et al.han2015learning () can be further improved without any algorithmic change.
Table 4 gives an overall comparison of key statistics for AlexNet. Apart from the number of parameters, there are some other factors affecting the efficiency of DNN models. Here FLOPs is the total number of floatingpoint operations. Storage is measured with of 8bit weights and 4bit indices as indicated in Section 5. Due to the fact that the storage of convolutional layers is much smaller but reused much more frequently than fullyconnected layers, we add an additional row for storage of convolutional layers. The number of memory referenced is calculated by simulating the process of Figure 8. Here the baseline number of memory references is obtained from dense model with sparse activations.
The results show that the our finegrained pruned model already has advantages over the previous stateofart work in terms of FLOPs, storage of convolutional layers and number of memory references. Moreover, compared with our finegrained baseline, the vector pruning method can further reduce the storage of convolutional layers by 23% and the number of memory references by 43%.
Layer  Param.  NIPS’15  NIPS’16  Finegrained  Vector  Kernel 

han2015learning ()  guo2016dynamic ()  Pruning  Pruning  Pruning  
(ours)  (ours)  (ours)  
conv1  35K  84%  54%  83%  83%  83% 
conv2  307K  38%  41%  26%  26%  26% 
conv3  885K  35%  28%  23%  23%  23% 
conv4  664K  37%  32%  23%  23%  23% 
conv5  443K  37%  33%  23%  23%  23% 
fc6  38M  9%  3.7%  7%  7%  7% 
fc7  17M  9%  6.6%  7%  7%  7% 
fc8  4M  25%  4.6%  18%  18%  18% 
Total  61M  11%  5.7%  8.4%  8.4%  8.4% 
FLOPs  1.5B  30%  25.4%  24.1%  24.1%  24.1% 
Storage(conv)  2.3MB  55.6%  48.3%  36.4%  28.0%  25.5% 
Storage(total)  61MB  16.7%  8.5%  12.6%  12.3%  12.2% 
#Mem Reference  99M  74.4%  71.7%  60.5%  34.6%  35.2% 
Top5 Accuracy  80.23%  80.01%  80.41%  79.94%  79.20% 
8 Conclusion
We thoroughly explored the granularity of sparsity with experiments on detailed accuracydensity relationship. Due to the advantage of index saving, coarsegrained pruning is able to achieve a higher model compression ratio, which is desirable for mobile implementation. We also analyzed the hardware implementation advantages and show that coarsegrained sparsity saves output memory access compared with finegrained sparsity, and compared with dense implementation. Given the advantages of simplicity and efficiency from a hardware perspective, coarsegrained sparsity enables more efficient hardware architecture design of deep neural networks.
References
 [1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, and et al. . Indatacenter performance analysis of a tensor processing unit. In 44th International Symposium on Computer Architecture, 2017.
 [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [4] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [5] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.
 [6] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [7] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [8] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [9] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 [10] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient transfer learning. International Conference on Learning Representations, 2017.
 [11] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [12] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017.
 [13] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In NIPs, volume 2, pages 598–605, 1989.
 [14] Andries Petrus Engelbrecht. A new pruning heuristic based on variance analysis of sensitivity information. IEEE transactions on Neural Networks, 12(6):1386–1399, 2001.
 [15] C Lee Giles and Christian W Omlin. Pruning recurrent neural networks for improved generalization performance. IEEE transactions on neural networks, 5(5):848–851, 1994.
 [16] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 75–84. ACM, 2017.
 [17] Vadim Lebedev and Victor Lempitsky. Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016.
 [18] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254. IEEE Press, 2016.
 [19] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joe Emer, Stephen Keckler, and William J. Dally. SCNN: An accelerator for compressedsparse convolutional neural networks. In 44th International Symposium on Computer Architecture, 2017.
 [20] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 2016.
 [21] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 26–35. ACM, 2016.
 [22] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [24] Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. LCNN: Lookupbased convolutional neural network. arXiv preprint arXiv:1611.06473, 2016.
 [25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [26] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [27] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. Citeseer, 2011.
 [28] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.