Network Pruning for LowRank Binary Indexing
Abstract
Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, finegrained pruning has not gained much attention due to its irregular index form leading to large memory footprint and low parallelism for convolutions and matrix multiplications. In this paper, we propose a new network pruning technique that generates a lowrank binary index matrix to compress index data while decompressing index data is performed by simple binary matrix multiplication. This proposed compression method finds a particular finegrained pruning mask that can be decomposed into two binary matrices. We also propose a tilebased factorization technique that not only lowers memory requirements but also enhances compression ratio. Various DNN models can be pruned with much fewer indexes compared to previous sparse matrix formats while maintaining the same pruning rate.
Network Pruning for LowRank Binary Indexing
Dongsoo Lee^{1}, Se Jung Kwon^{1}, Byeongwook Kim^{1}, Parichay Kapoor^{1}, GuYeon Wei^{1,2} ^{1} Samsung Research, Seoul, Republic of Korea ^{2} Harvard University, Cambridge, MA {dongsoo3.lee, sejung0.kwon, byeonguk.kim, pk.kapoor, gy.wei}@samsung.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Numerous parameter pruning techniques have been introduced based on the observation that significant amounts of connectivity can be removed without sacrificing model accuracy. Current active research strives to enhance pruning rate at the cost of additional computations Guo et al. [2016], Molchanov et al. [2017], to reduce computational complexity of pruning procedures Zhu and Gupta [2017], Han et al. [2015], and to find the underlying principles of pruning Liu et al. [2019], Frankle and Carbin [2019], to name a few.
Unfortunately, conventional pruning techniques present serious implementation challenges when deployed in highly parallel computing systems due to their sparse matrix formats. If pruning is performed in a finegrained manner (i.e., each parameter is evaluated to be pruned) to improve overall pruning rate, then each row/column or block exhibits different sparsity. Consequently, row/columnwise or blockwise computations require vastly different computation latency leading to significantly degraded parallelism Han et al. [2016a]. On the other hand, successful DNN deployment depends on accelerating matrix multiplications and convolutions using threadoriented computational resources such as GPUs and multithreaded CPUs Dean et al. [2012]. Consequently, finegrained pruning techniques have not been adopted by industry. Note that a few attempts have been made to utilize sparsity to expedite inference for batch size of 1 Han et al. [2016a]. However, if parameter reuse rate increases due to convolutions or a large batch size, then sparse matrices show even slower performance than dense matrices before pruning Han et al. [2016a], Zhang et al. [2016]. Recent pruning techniques suggest removing connectivity in a wellstructured form, Wen et al. [2016], He et al. [2017], or in a blocklevel Yu et al. [2017], Narang et al. [2017].
In this paper, we propose a new finegrained pruning method to find an efficient sparse matrix representation based on binary indexmatrix factorization. Figure 1 shows a dense matrix after pruning redundant parameters and various masking index representations. The binary index form is a regular structure that can utilize full memory bandwidth even though sparsity does not reduce memory footprint. Compressed sparse row (CSR) indexing presents an irregular structure that reduces storage requirements according to sparsity. In contrast, our proposed index compression scheme decomposes a binary index matrix into two small binary matrices in order to maintain regular structure. Binary matrix multiplication is inherently parallelizable and sparse quantized networks can be decoded efficiently using recent approaches Ahn et al. [2019]. In order to accomplish this indexing form, we propose an algorithm that finds a particular finegrained pruning result by generating a lowrank binary index matrix. Since factorization may not exactly reproduce the original binary index matrix, we investigate whether a lowrank binary index matrix produces a pruning result while maintaining allowable model accuracy. Our proposed binary matrix factorization technique significantly reduces the amount of index data compared to using CSR, as demonstrated in the next sections. We also introduce a tiling technique to alleviate the burden of binary matrix multiplication and further improve compression ratio.
Recently, a regularstructured index compression method has been proposed using the Viterbi algorithm Lee et al. [2018a], which explores sequences to be used as pruning mask bits, decompressed by Viterbi encoders. Even though compression rate improves over CSR, for every one input bit, a large number of XOR gates and delay units is required. In comparison, our proposed decompression relies on simple binary matrix multiplications and achieves even higher compression.
2 Binary Matrix Factorization for Pruning Index
Suppose that a weight matrix is given as
(1) 
Following the magnitudebased pruning method Han et al. [2015], all weights with magnitude smaller than a certain threshold value are pruned to zero. For example, for a threshold of 0.7, we obtain the following pruning index (i.e., binary masking layer) matrix as
(2) 
It is important to note that magnitudebased pruning is not an optimal solution Lee et al. [2018a], LeCun et al. [1990], Han et al. [2016b] to maximize the pruning rate, i.e., a variety of masking layers exist for the same pruning rate.
Binary matrix factorization (BMF) Zhang et al. [2007] is a reasonable approach to compress . For , BMF tries to find the best and to approximate as , where corresponds to the binary matrix factorization rank and stands for binary matrix multiplication. A binary product of and is then defined as
(3) 
While BMF should minimize the number of mismatched bits between and , such an optimization is NPhard. Hence, several heuristic methods for BMF have been proposed in the literature Zhang et al. [2007]. Moreover, BMF using (without referring to ) lacks weight magnitude information. Yet, in the context of magnitudebased pruning, weights of small magnitude should be pruned with higher probability (i.e., lower importance). Therefore, we explore a method that preserves the importance of each weight when it is not possible to exactly reproduce through matrix factorization.
2.1 Binary Matrix Factorization based on NonNegative Matrix Factorization
Nonnegative matrix factorization (NMF) factorizes a realvalued matrix into two realvalued matrices and under the constraint that all three matrices consist of nonnegative elements. Similar to singularvalue decomposition (SVD), NMF attempts to minimize , where denotes the Frobenius norm of the matrix . The property of nonnegativity in NMF is useful for analysis of various multivariate data. Numerous numerical approximations of NMF have been suggested since an exact solution of NMF is not generally available Lee and Seung [1999], Zhang et al. [2007].
In our proposed technique, we take the magnitude of each element of to generate (i.e., ). and is factorized by an NMF library (e.g., Zitnik and Zupan [2012]) into two matrices and . For example, a magnitude matrix of the matrix of Eq. (1) can be factorized into
(4) 
where rank is 2.
The next step is to convert and into two binary matrices and using threshold values and (i.e., if , or 0 otherwise). The sparsity of and can each be controlled by and . Our goal is to achieve similar sparsity between and . Suppose that and are carefully chosen to produce similar sparsity as . We obtain
(5) 
and the binary product of and becomes
(6) 
Compared with the pruningindex matrix in Eq. (2), there are 2 mismatched elements (underlined in Eq. (6)).
The rationale behind this approach is as follows: 1) If is large, then its corresponding components of and (i.e., and ) will also be large with high probability and, correspondingly, 2) Binary matrix conversion using and would yield a high probability of being ‘1’ within and if the corresponding is large. Let , , , and be the sparsity of , , , and , respectively. From the dot product operation in Eq. (6), the expression for pruning rate becomes
(7) 
assuming that the probability of a bit being ‘0’ in and follows and . Then, = , which needs to be finetuned in practice. If and associated are given, then and are automatically determined by the target pruning rate. Subsequently, given and rank , it is necessary to find a certain that produces the best masking layer for pruning.
In order to optimize , we define the cost function for pruningindex compression to be , where and (i.e., a sum of all unintentionally pruned weights’ magnitude by binary matrix decomposition). Algorithm 1 describes a method to find the best by sweeping and monitoring . Given the linear relationship between and , the algorithm can use binary search to expedite adjustment of .
2.2 MNIST Case Study
We applied our proposed pruningindex compression technique to the LeNet5 model with the MNIST dataset. LeNet5 consists of two convolutional layers and two fullyconnected layers Han et al. [2015], Lee et al. [2018b]. Since the FC1 layer dominates the memory footprint (93%), we focus on the FC1 layer to investigate the effectiveness of our proposed method, but all layers are pruned with the same rates as in Han et al. [2015]. FC1 involves a weight matrix, where full rank is 500. Figure 2 plots , , and test accuracy across a range of values. Pruning rate for FC1 is 95% and rank is 16, 64, or 256. As increases, both and become more sparse and both and test accuracy improve.
After pretraining for 20K iterations, pruning via BMF is performed as described in Algorithm 1. Then, retraining runs until the 60K^{th} iteration using a masking layer . The test accuracy (the average of 20 runs) is measured at the 20K^{th} (right after pruning), 40K^{th}, 50K^{th}, and 60K^{th} iterations, as shown in Table 1 along with the compression ratio computed as . Compared with other binary pruning index formats (such as ternary quantization), our proposed binary index factorization yields much higher index compression ratios while maintaining reasonable test accuracy.


Rank  Accuracy(%)  Comp.  
()  20K^{th}  40K^{th}  50K^{th}  60K^{th}  Ratio 


4  33.69  98.96  99.03  99.07  76.9 
8  34.52  98.98  99.05  99.09  38.5 
16  30.28  99.01  99.09  99.13  19.2 
32  32.52  99.01  99.08  99.13  9.6 
64  36.02  99.04  99.10  99.14  4.8 
128  34.22  99.04  99.10  99.16  2.4 
256  42.86  99.07  99.14  99.19  1.2 



Method  Index Size  Comment 


Binary  50.0KB  1bit/weight 
CSR(16bit)  45.8KB  
CSR(5bit)  14.3KB  Relative 
Viterbi  10.0KB  5X encoder 
Proposed  2.6KB  =16 

In general, the histogram of a weight matrix of a pretrained model follows a Gaussian distribution Goodfellow et al. [2016]. For magnitudebased pruning, the count of nearzero values significantly reduces after pruning. To investigate which weights are pruned by different rank, Figure 3 presents the histogram of unpruned weight values of the FC1 layer of LeNet5 model at the 20K^{th} iteration right after performing Algorithm 1 using the same pruning rate of 95%. Results show the count of nearzero weights reduces as rank increases. Since the total count is the same in Figure 3 for different rank , a low count of nearzero weights in the histogram implies that fewer weights of large magnitude are pruned. As such, there is a tradeoff between the rank and accuracy even though accuracy drop is reasonable for a wide range of rank values as shown in Table 1.
Table 1 also compares index size for the FC1 layer using different formats while keeping pruning rate at 95.0%. Reduction in index size for CSR can be achieved using a relative index scheme as described in Han et al. [2016b]. In the case of Viterbibased pruning, we assume the Viterbi decompressor has 40 outputs, 10bit comparators, and a skip state of 1 Lee et al. [2018a]. Our proposed pruning method using binary matrix factorization yields the best compression ratio.
3 Index Matrix Tiling and Weight Manipulation to Lower Rank
If rank increases to drop more nearzero weights, compression ratio must be sacrificed. To decrease the count of nearzero weights after pruning for a fixed rank , we propose two techniques.
3.1 TileBased Binary Matrix Factorization
Weight matrix size increases quadratically with the number of neurons. If the whole binary matrix multiplication were to be performed within a chip, then onchip memory size would be prohibitively large. In order to enhance scalability of our proposed factorization method, we propose tilebased factorization as illustrated in Figure 4. A () binary matrix is first tiled into multiple blocks. Then, each block is factorized independently with the same rank. Note that tiling size and/or reshaping the original binary index matrix can be varied to match the underlying parallel computing resources. Because NMF is solved by an iterative method in general, tiling not only reduces the required onchip memory size but also reduces NMF computation time.
Tiling also affects the quality of binary matrix factorization. Note that when , , , are random samples of size from a distribution with mean and variance , the sample mean shows and . Indeed, Figure 4 depicts large variance of weight values after NMF when the number of tiles increases (a random Gaussian weight distribution is first assumed). Correspondingly, and also show larger variance with longer tails in the distribution with more blocks as shown in Figure 5. Such increased variance of and with smaller sample size is desirable for binary conversion from the NMF result. For example, compared with {0.98, 1.0, 1.2}, {0.5, 1.0, 1.5} presents larger spectrum of binary conversion threshold values for and (and hence, increases the chance to further optimize the cost function).
Figure 6 plots histograms of unpruned FC1 layer weights of the LeNet5 model on MNIST across different tiling methods. FC1 weight matrix (of size) is tiled into 1(), 4(), or 16() submatrices while the index compression ratio is the same for all three tiling cases. Since the size of a submatrix differs for each partitioning plan, the rank for factorizing a submatrix is accordingly adjusted in Figure 6. Notice that increasing the number of blocks yields deeper drops of nearzero weights. However, if submatrices have different properties (e.g., an embedding matrix in natural language models) by tiling, each submatrix can optimally choose a different rank.
3.2 Weight Magnitude Manipulation
If is given as a magnitudesum of unintentionally pruned weights, then it is still possible that a large weight can be pruned. To prevent large weights from being pruned and to keep the definition of , the magnitude of weights can be preprocessed. For example, in the context of magnitudebased pruning, artificially increasing already large weight values can futher lower their probability of being pruned. Note that such weight magnitude manipulation is only temporarily used for pruningindex data compression, and not for normal training or inference steps.
Figure 7 plots histograms of unpruned weights of the FC1 layer of LeNet5 using different weightmagnitude manipulation methods. In Method 3, weights larger than a threshold are multiplied by . We observe the sharpest drop of weights around the threshold value and higher count of large weights in Method 3. Finding the best weight manipulation method is an open problem.
4 Experimental Results
We verify our index data compression technique using ResNet32 He et al. [2016] on CIFAR10, AlexNet Krizhevsky et al. [2012] on ImageNet, and a recurrent neural network (RNN) model on the PTB dataset Marcus et al. [1994], as shown in Table 2. We do not apply binary matrix factorization to layers if their parameter size is relatively small. Such layers usually present small pruning rates Han et al. [2015], Zhu and Gupta [2017] and index compression ratio is also small Yu et al. [2017]. For example, Viterbibased index compression techniques yield 1 compression ratios for the index data of convolutional layers in AlexNet Lee et al. [2018a].
For ResNet32, our pruningindex compression achieves a 70% pruning rate with 91.8% accuracy, providing over 3 compression as shown in Table 2 (experimental data using various rank selections id provided in Appendix). Note that binary pruningindex matrix decomposition is performed after multiplying the weights (exceeding the pruning threshold) by , as a method to drop more nearzero weights.
For AlexNet, we focus on compressing index data of FC5 and FC6 layers occupying 90% of the total model size. Both layers are pruned to a pruning rate of 91% Han et al. [2015] using Algorithm 1. We achieve over 8 compression in the pruningindex data while maintaining fullprecision accuracy. FC5 and FC6 weights are tiled into small blocks, given their large matrix size. BMF of Algorithm 1 is performed on those blocks with ranks 32 and 64, resulting in speed up and reduced .
An RNN model with one longshort term memory (LSTM) layer of size 300 Xu et al. [2018] on the PTB dataset, with performance measured using Perplexity Per Word (PPW), is pruned by our pruningindexcompression method. Note that the embedding and softmax matrices usually take a major memory footprint because of increasing vocabulary size in a neural language model while these two matrices have several distinguished properties compared with general weight matrices Chen et al. [2018].
To compare the compression ratio of various sparse matrix representation schemes, we choose AlexNet FC5 and FC6 layers and Table 3 shows the index size of a binary matrix representation, CSR format with 16bit indexing, CSR format with 5bit indexing (relative indexing as introduced in Han et al. [2016b]), Viterbibased representation Lee et al. [2018a], and our proposed representation. Our proposed network pruning algorithm and index representation using binary matrix factorization significantly reduce the amount of indexing data while maintaining sparsity at the same level as finegrained pruning. Lastly, while Viterbibased compression only allows for integervalued compression ratios, our proposed compression technique enables a much wider range of compression ratios (as rational numbers) by controlling the rank for index pruning.


Pretrained Model  Pruned Model using the Proposed Method  
Model  Size  Accuracy  Rank  Comp. Ratio  Accuracy  






8/16/32^{1} 



8/8/8 


















145 

89.0 PPW  


Different ranks are applied to 3 groups of layers according to the number of input channels (16, 32, or 64).

The FC5 and FC6 layers are tiled to 168 blocks (of 576512 size) and 88 blocks (of 512512 size), respectively.


Method  FC5 Index Size  FC6 Index Size  Sum  Comment 
Binary  4608KB  2048KB  6656KB  1bit/weight 
CSR(16bit)  6962KB  3099KB  10061KB  
CSR(5bit)  2176KB  968KB  3144KB  Relative Indexing 
Viterbi  922KB  410KB  1331KB  5X Encoder 
Proposed  556KB  256KB  812KB  =32, tiled 

5 Conclusion
This paper proposes a finegrained network pruning technique to produce lowrank binary index matrices. We confirm various DNN models can be pruned by our lowrank indexing scheme while preserving sparsity. Such binary matrix multiplication enables not only high compression ratios, but highly parallel operations of sparse networks. Our proposed tiling method and weight magnitude manipulation schemes further lower rank. Unlike previous studies, this work demonstrates that finegrained pruning can be represented by a highly regular structured format.
References
 Ahn et al. [2019] D. Ahn, D. Lee, T. Kim, and J.J. Kim. Double Viterbi: Weight encoding for high compression ratio and fast onchip reconstruction for deep neural network. In International Conference on Learning Representations (ICLR), 2019.
 Chen et al. [2018] P. Chen, S. Si, Y. Li, C. Chelba, and C.J. Hsieh. GroupReduce: Blockwise lowrank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, 2018.
 Dean et al. [2012] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 Frankle and Carbin [2019] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 Guo et al. [2016] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, 2016.
 Han et al. [2015] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 Han et al. [2016a] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254, 2016a.
 Han et al. [2016b] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016b.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 He et al. [2017] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 LeCun et al. [1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
 Lee et al. [2018a] D. Lee, D. Ahn, T. Kim, P. I. Chuang, and J.J. Kim. Viterbibased pruning for sparse matrix with fixed and high index compression ratio. In International Conference on Learning Representations (ICLR), 2018a.
 Lee et al. [2018b] D. Lee, P. Kapoor, and B. Kim. Deeptwist: Learning model compression via occasional weight distortion. arXiv:1810.12823, 2018b.
 Lee and Seung [1999] D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 Liu et al. [2019] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019.
 Marcus et al. [1994] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, pages 114–119, 1994.
 Molchanov et al. [2017] D. Molchanov, A. Ashukha, and D. P. Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning (ICML), pages 2498–2507, 2017.
 Narang et al. [2017] S. Narang, E. Undersander, and G. F. Diamos. Blocksparse recurrent neural networks. arXiv:1711.02782, 2017.
 Wen et al. [2016] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2082–2090, 2016.
 Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multibit quantization for recurrent neural networks. In International Conference on Learning Representations (ICLR), 2018.
 Yu et al. [2017] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 548–560, 2017.
 Zhang et al. [2016] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. Cambriconx: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 20:1–20:12, 2016.
 Zhang et al. [2007] Z. Zhang, T. Li, C. H. Q. Ding, and X. Zhang. Binary matrix factorization with applications. In IEEE International Conference on Data Mining, 2007.
 Zhu and Gupta [2017] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017.
 Zitnik and Zupan [2012] M. Zitnik and B. Zupan. Nimfa: A python library for nonnegative matrix factorization. Journal of Machine Learning Research, 13:849–853, 2012.
Appendix A Supplementary Experimental Results on ResNet32 using CIFAR10
Table 4 lists the accuracy of ResNet32 using various ranks and pruning rates. The pruned models using our proposed binary matrix decomposition show slightly degraded accuracy compared to the baseline pruning method (the bottom row of Table 4). It can be observed that in general, there exists a tradeoff between the ranks and compression ratios.
Pruning Rate ()  
Rank^{1}  Comp. Ratio  0.60  0.70  0.80 
4/4/4  10.29  92.0%  91.5%  90.5% 
4/8/16  6.74  92.2%  91.5%  90.9% 
8/8/8  5.12  92.0%  91.5%  90.9% 
8/16/32  3.09  92.4%  91.8%  91.1% 
16/16/16  2.56  92.2%  91.8%  91.0% 
16/32/64  1.55  92.3%  91.7%  91.2% 
w/o BMF  1  92.4%  92.2%  92.0% 

Different ranks are applied to 3 groups of layers according to the number of input channels (16, 32, or 64).

Retraining is performed for 65K iterations.