Fast Onthefly Retrainingfree Sparsification
of Convolutional Neural Networks
Abstract
Modern Convolutional Neural Networks (CNNs) are complex, encompassing millions of parameters. Their deployment exerts computational, storage and energy demands, particularly on embedded platforms. Existing approaches to prune or sparsify CNNs require retraining to maintain inference accuracy. Such retraining is not feasible in some contexts. In this paper, we explore the sparsification of CNNs by proposing three modelindependent methods. Our methods are applied onthefly and require no retraining. We show that the stateoftheart models’ weights can be reduced by up to 73% (compression factor of 3.7) without incurring more than 5% loss in Top5 accuracy. Additional finetuning gains only 8% in sparsity, which indicates that our fast onthefly methods are effective.
Fast Onthefly Retrainingfree Sparsification
of Convolutional Neural Networks
Amir H. Ashouri University of Toronto Canada aashouri@ece.utoronto.ca and Tarek S. Abdelrahman University of Toronto Canada tsa@ece.utoronto.ca and Alwyn Dos Remedios Qualcomm Inc. Canada adosreme@qti.qualcomm.com
noticebox[b]Conference on Neural Information Processing Systems (NIPS), CDNNRIA Workshop, 2018, Montréal, Canada.\end@float
1 Introduction
There has been a significant growth in the number of parameters (i.e., layer weights), and the corresponding number of multiplyaccumulate operations (MACs), in stateoftheart CNNs [15, 14, 20, 24, 9, 12, 25, 23]. Thus, it is to no surprise that several techniques exist for “pruning” or “sparsifying” CNNs (i.e., forcing some model weights to 0) to both compress the model and to save computations during inference. Examples of these techniques include: iterative pruning and retraining [3, 8, 4, 21, 18], Huffman coding [6], exploiting granularity [16, 5], structural pruning of network connections [26, 17, 1, 19], and Knowledge Distillation (KD) [10].
A common theme to the aforementioned techniques is that they require a retraining of the model to finetune the remaining nonzero weights and maintain inference accuracy. Such retraining, while feasible in some contexts, is not feasible in others, particularly industrial ones. For example, for mobile platforms, a machine learning model is typically embedded within an app for the platform that the user directly downloads. The app utilizes the vendor’s platform runtime support (often in the form of a library) to load and use the model. Thus, the platform vendor must sparsify the model at runtime, i.e., onthefly, within the library with no opportunity to retrain the model. Further, the vendor rarely has access to the labelled data used to train the model. While techniques such as Knowledge Distillation [10] can address this lack of access, it is not possible to apply it onthefly.
In this paper, we develop fast retrainingfree sparsification methods that can be deployed for onthefly sparsification of CNNs in the contexts described above. There is an inherent tradeoff between sparsity and inference accuracy. Our goal is to develop modelindependent methods that result in large sparsity with little loss to inference accuracy. We develop three modelindependent sparsification methods: flat, triangular, and relative. We implement these methods in TensorFlow and use the framework to evaluate the sparsification of several pretrained models: Inceptionv3, MobileNetv1, ResNet, VGG, and AlexNet. Our evaluation shows that up to 81% of layer weights in some models may be forced to 0, incurring only a 5% loss in inference accuracy. While the relative method appears to be more effective for some models, the triangular method is more effective for others. Thus, a predictive modeling autotuning [7, 2] is needed to identify, at runtime, the optimal choice of method and it hyperparameters.
2 Sparsification Methods
Sparsity in a CNN stems from three main sources: (1) weights within convolution (Conv) and fullyconnected (FC) layers (some of these weights may be zero or may be forced to zero); (2) activations of layers, where the oftenapplied ReLU operation results in many zeros [22]; and (3) input data, which may be sparse. In this paper, we focus on the first source of sparsity, in both Conv and fully connected layers. This form of sparsity can be determined a priori, which alleviates the need for specialized hardware accelerators.
The input to our framework is a CNN that has layers, numbered . The weights of each layer are denoted by . We sparsify these weights using a sparsification function, , which takes as input and a threshold from a vector of thresholds . Each weight of is modified by as follows:
(1) 
where is the threshold used for layer . Thus, applying a single threshold forces weights within and in value to become 0. Our use of thresholds to sparsify layers is motivated by the fact that recent CNNs’ weights are distributed around the value .
The choice of the values of the elements of the vector defines a sparsification method. These values impact the resulting sparsity and inference accuracy. We define and compare three sparsification methods. The flat method defines a constant threshold for all layers, irrespective of the distribution of their corresponding weights. The triangular method is inspired by the size variation of layers in some stateoftheart CNNs, where the early layers have smaller number of parameters than latter layers. Finally, the relative method defines a unique threshold for each layer that sparsifies a certain percentage of the weights in the layer. The three methods are depicted graphically in Figure (a)a. The high level workflow of the sparsification framework is depicted in Figure (b)b.
Flat Method
This method defines a constant threshold for all layers, irrespective of the distribution of their corresponding weights. It is graphically depicted in the top of Figure (a)a. The weights of the layers are profiled to determine the span . This span corresponds to the layer having the smallest range of weights within the pretrained model. This span is used as an upperbound value for our flat threshold . Since using as a threshold eliminates all the weights in layer and is likely to adversely affect the accuracy of the sparsified model, we use a fraction , , of the span , where is a parameter of the method that can be varied to achieve different degrees of model sparsity.
Triangular Method
The triangular method is defined by two thresholds and for respectively the first convolution layer (i.e, layer 1) and the last fully connected layer (i.e., layer ). They represent the thresholds at the tip and the base of the triangle in middle part of Figure (a)a. These thresholds are determined by the span of the weights in each of the two layers. Thus,
(2) 
where is the span of the weights in the first convolution layer, defined in a similar way as for the flat method, and it represents an upper bound on . Thus, is a fraction that ranges between and . Similarly, is the span of the weights in the last fully connected layer and it represents an upper bound on . Thus, is a fraction that ranges between and . The thresholds of the remaining layers are dictated by the position of these layers in the network.
Relative Method
In particular, it uses the percentile of distribution’s weight in layer denoted by . Thus each element of the vector is defined as:
(3) 
where defines the desired percentile of the of zero weights in each layers.
3 Experimental Evaluation and Comparison
We evaluate our sparsification methods using TensorFlow v1.4 with CUDA runtime and driver v8. The evaluation of Top5 accuracy is done on an NVIDIA’s GeForce GTX 1060 with a host running Ubuntu 14.04, kernel v3.19 using ImageNet [13]. Figures (a)a, (b)b, (c)c, (d)d, and (e)e show the inference accuracy as a function of the introduced sparsity by each method. They reflect that significant sparsity can be obtained with a small reduction in inference accuracy. With less than 5% reduction in accuracy, we gain 51% sparsity (2.04 compression factor), 50% (2), 62% (2.63), 70% (3.33), and 73% (3.7) for the models. This validates our approach.
Further, the figures reflect that the relative method outperforms the other two methods for the Inceptionv3 [25], VGG [20] and ResNet [9], but the triangular one outperforms the other two for MobileNetv1 [11] and Alexnet [14]. This is likely due to the structure of the models. MobileNetv1 and Alexnet have a gradual linear increase in the size of the convolution layers, making the triangular method more effective. In contrast, the other models have no such increase, making the relative method more effective. As a caseinpoint, ResNet has 152 conv layers with variable sizes, which makes the triangular method less effective, as seen by the drop in accuracy in Figure (c)c.
An interesting observation is that for Alexnet, introducing the first 50% sparsity suffers little drop in accuracy. This value is 35%, 30%, 41%, and 42% for the other models and it shows there exists significant redundancy within CNNs. Han et al. [6] observe the same with their Caffe implementation of Alexnet. The work was mainly focused on iteratively pruning and retraining CNNs to compensate the loss of accuracy. The authors’ method with no retraining is not specified and it is unclear if it applies to other CNNs or not. However, the authors report a gain of around 80% sparsity by pruning (i.e., without retraining) Alexnet with L2 regularization. Our evaluation validates their result across other models using the proposed onthefly methods.

Finetuning. We explore what can be achieved by some finetuning of our methods, still with no retraining, in order to gain more sparsity. We do so to determine the effectiveness of our onthefly methods, since the finetuning is not likely feasible in out context. We focus on the relative method and start with a baseline sparsity. We then vary the degree of sparsity of each layer in turn around the base sparsity, attempting to maintain a no more than 5% drop in inference accuracy. The results for only AlexNet (due to space limitations) are shown in the table in Figure (f)f. The baseline sparsity is selected as 70%. It is possible for some layers, particularly larger ones, to have higher sparsity, while smaller/earlier layers are more sensitive to sparsification and must have lower sparsity. Nonetheless, there is a gain of 8% in overall model sparsity. This value is 4%, 3%, 2% , and 5% for Inceptionv3, Mobilenetv1, ResNet, and VGG, respectively. Since this gain comes at the expense of an exploration of different sparsity ratios for the layers and thus more computations, it is not feasible in the contexts we explore. However, the gain is not significant to render our onthefly methods inefficient on their own without further tuning.
4 Concluding Remarks
In this paper, we proposed three modelindependent methods to explore sparsification of CNNs without retraining. We experimentally evaluated these methods and showed that they can result in up to 73% sparsity with less than 5% drop in inference accuracy. However, there is no single method that works best for all models. Further, our evaluation showed that it is possible to finetune the methods to further gain sparsity with no significant drop in inference accuracy. However, such tuning of the methods cannot be employed onthefly. There are two key directions for future work. The first is to explore heuristics for selecting a sparsification method based on the CNN model and possibly fine tune the parameters of the methods using a predictive modeling. The second is to realize the benefit of the sparsity in the model’s implementation on the NNlib library, which offloads neural networks operations from TensorFlow to Qualcomm’s HexagonDSP.
References
 [1] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE Int. Conf. Acoust. Speech Signal Process., pages 1131–1135. IEEE, apr 2015.
 [2] Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. A survey on compiler autotuning using machine learning. ACM Comput. Surv., 51(5):96:1–96:42, September 2018.
 [3] Yann Le Cun, John S Denker, and Sara a Solla. Optimal Brain Damage. Adv. Neural Inf. Process. Syst., 2(1):598–605, 1990.
 [4] Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to Prune Deep Neural Networks via Layerwise Optimal Brain Surgeon. NIPS, pages 4860–4874, may 2017.
 [5] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proc.  2016 43rd Int. Symp. Comput. Archit. ISCA 2016, pages 243–254. IEEE, jun 2016.
 [6] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv Prepr. arXiv1510.00149, oct 2015.
 [7] Tianyi David Han and Tarek S Abdelrahman. Automatic tuning of local memory use on gpgpus. arXiv preprint arXiv:1412.6986, 2014.
 [8] Babak Hassibi and David Stork. Second order derivaties for network prunning: Optimal brain surgeon. In Nips, pages 164–171, 1993.
 [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pages 770–778, 2016.
 [10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [11] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv, 587:9, apr 2017.
 [12] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and< 0.5 MB model size. In arXiv, feb 2017.
 [13] Jia Deng, Wei Dong, R Socher, LiJia Li, Kai Li, and Li FeiFei. ImageNet: A largescale hierarchical image database. In 2009 IEEE Conf. Comput. Vis. Pattern Recognit., pages 248–255, 2009.
 [14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst., pages 1–9, 2012.
 [15] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [16] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. Exploring the Granularity of Sparsity in Convolutional Neural Networks. In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., volume 2017July, pages 1927–1934, may 2017.
 [17] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. arXiv Prepr. arXiv1705.08922, may 2017.
 [18] Manu Mathew, Kumar Desappan, Pramod Kumar Swami, and Soyeb Nagori. Sparse, Quantized, Full Frame CNN for Low Power Embedded Devices. In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., volume 2017July, pages 328–336, 2017.
 [19] Vlad Niculae and Mathieu Blondel. A Regularized Framework for Sparse and Structured Neural Attention. papers.nips.cc, 2017.
 [20] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. Int. Conf. Learn. Represent., pages 1–14, sep 2015.
 [21] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Sparsifying Neural Network Connections for Face Recognition. In 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pages 4856–4864. IEEE, jun 2016.
 [22] Vivienne Sze, YuHsin Chen, TienJu Yang, and Joel S. Emer. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12):2295–2329, dec 2017.
 [23] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
 [24] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., volume 0712June, pages 1–9. IEEE, jun 2015.
 [25] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. arXiv, dec 2015.
 [26] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. Nature, 521(12):61–4, 2016.