Network Pruning for Low-Rank Binary Indexing

Network Pruning for Low-Rank Binary Indexing

Dongsoo Lee1, Se Jung Kwon1, Byeongwook Kim1, Parichay Kapoor1, Gu-Yeon Wei1,2
1 Samsung Research, Seoul, Republic of Korea
2 Harvard University, Cambridge, MA
{dongsoo3.lee, sejung0.kwon, byeonguk.kim, pk.kapoor, gy.wei}@samsung.com
Abstract

Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, fine-grained pruning has not gained much attention due to its irregular index form leading to large memory footprint and low parallelism for convolutions and matrix multiplications. In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data while decompressing index data is performed by simple binary matrix multiplication. This proposed compression method finds a particular fine-grained pruning mask that can be decomposed into two binary matrices. We also propose a tile-based factorization technique that not only lowers memory requirements but also enhances compression ratio. Various DNN models can be pruned with much fewer indexes compared to previous sparse matrix formats while maintaining the same pruning rate.

 

Network Pruning for Low-Rank Binary Indexing


  Dongsoo Lee1, Se Jung Kwon1, Byeongwook Kim1, Parichay Kapoor1, Gu-Yeon Wei1,2 1 Samsung Research, Seoul, Republic of Korea 2 Harvard University, Cambridge, MA {dongsoo3.lee, sejung0.kwon, byeonguk.kim, pk.kapoor, gy.wei}@samsung.com

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

Numerous parameter pruning techniques have been introduced based on the observation that significant amounts of connectivity can be removed without sacrificing model accuracy. Current active research strives to enhance pruning rate at the cost of additional computations Guo et al. [2016], Molchanov et al. [2017], to reduce computational complexity of pruning procedures Zhu and Gupta [2017], Han et al. [2015], and to find the underlying principles of pruning Liu et al. [2019], Frankle and Carbin [2019], to name a few.

Unfortunately, conventional pruning techniques present serious implementation challenges when deployed in highly parallel computing systems due to their sparse matrix formats. If pruning is performed in a fine-grained manner (i.e., each parameter is evaluated to be pruned) to improve overall pruning rate, then each row/column or block exhibits different sparsity. Consequently, row/column-wise or block-wise computations require vastly different computation latency leading to significantly degraded parallelism Han et al. [2016a]. On the other hand, successful DNN deployment depends on accelerating matrix multiplications and convolutions using thread-oriented computational resources such as GPUs and multi-threaded CPUs Dean et al. [2012]. Consequently, fine-grained pruning techniques have not been adopted by industry. Note that a few attempts have been made to utilize sparsity to expedite inference for batch size of 1 Han et al. [2016a]. However, if parameter reuse rate increases due to convolutions or a large batch size, then sparse matrices show even slower performance than dense matrices before pruning Han et al. [2016a], Zhang et al. [2016]. Recent pruning techniques suggest removing connectivity in a well-structured form, Wen et al. [2016], He et al. [2017], or in a block-level Yu et al. [2017], Narang et al. [2017].

In this paper, we propose a new fine-grained pruning method to find an efficient sparse matrix representation based on binary index-matrix factorization. Figure 1 shows a dense matrix after pruning redundant parameters and various masking index representations. The binary index form is a regular structure that can utilize full memory bandwidth even though sparsity does not reduce memory footprint. Compressed sparse row (CSR) indexing presents an irregular structure that reduces storage requirements according to sparsity. In contrast, our proposed index compression scheme decomposes a binary index matrix into two small binary matrices in order to maintain regular structure. Binary matrix multiplication is inherently parallelizable and sparse quantized networks can be decoded efficiently using recent approaches Ahn et al. [2019]. In order to accomplish this indexing form, we propose an algorithm that finds a particular fine-grained pruning result by generating a low-rank binary index matrix. Since factorization may not exactly reproduce the original binary index matrix, we investigate whether a low-rank binary index matrix produces a pruning result while maintaining allowable model accuracy. Our proposed binary matrix factorization technique significantly reduces the amount of index data compared to using CSR, as demonstrated in the next sections. We also introduce a tiling technique to alleviate the burden of binary matrix multiplication and further improve compression ratio.

Recently, a regular-structured index compression method has been proposed using the Viterbi algorithm Lee et al. [2018a], which explores sequences to be used as pruning mask bits, decompressed by Viterbi encoders. Even though compression rate improves over CSR, for every one input bit, a large number of XOR gates and delay units is required. In comparison, our proposed decompression relies on simple binary matrix multiplications and achieves even higher compression.

Figure 1: Comparison on various sparse matrix representation formats.

2 Binary Matrix Factorization for Pruning Index

Suppose that a weight matrix is given as

(1)

Following the magnitude-based pruning method Han et al. [2015], all weights with magnitude smaller than a certain threshold value are pruned to zero. For example, for a threshold of 0.7, we obtain the following pruning index (i.e., binary masking layer) matrix as

(2)

It is important to note that magnitude-based pruning is not an optimal solution Lee et al. [2018a], LeCun et al. [1990], Han et al. [2016b] to maximize the pruning rate, i.e., a variety of masking layers exist for the same pruning rate.

Binary matrix factorization (BMF) Zhang et al. [2007] is a reasonable approach to compress . For , BMF tries to find the best and to approximate as , where corresponds to the binary matrix factorization rank and stands for binary matrix multiplication. A binary product of and is then defined as

(3)

While BMF should minimize the number of mismatched bits between and , such an optimization is NP-hard. Hence, several heuristic methods for BMF have been proposed in the literature Zhang et al. [2007]. Moreover, BMF using (without referring to ) lacks weight magnitude information. Yet, in the context of magnitude-based pruning, weights of small magnitude should be pruned with higher probability (i.e., lower importance). Therefore, we explore a method that preserves the importance of each weight when it is not possible to exactly reproduce through matrix factorization.

2.1 Binary Matrix Factorization based on Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) factorizes a real-valued matrix into two real-valued matrices and under the constraint that all three matrices consist of non-negative elements. Similar to singular-value decomposition (SVD), NMF attempts to minimize , where denotes the Frobenius norm of the matrix . The property of non-negativity in NMF is useful for analysis of various multivariate data. Numerous numerical approximations of NMF have been suggested since an exact solution of NMF is not generally available Lee and Seung [1999], Zhang et al. [2007].

In our proposed technique, we take the magnitude of each element of to generate (i.e., ). and is factorized by an NMF library (e.g., Zitnik and Zupan [2012]) into two matrices and . For example, a magnitude matrix of the matrix of Eq. (1) can be factorized into

(4)

where rank is 2.

The next step is to convert and into two binary matrices and using threshold values and (i.e., if , or 0 otherwise). The sparsity of and can each be controlled by and . Our goal is to achieve similar sparsity between and . Suppose that and are carefully chosen to produce similar sparsity as . We obtain

(5)

and the binary product of and becomes

(6)

Compared with the pruning-index matrix in Eq. (2), there are 2 mismatched elements (underlined in Eq. (6)).

The rationale behind this approach is as follows: 1) If is large, then its corresponding components of and (i.e., and ) will also be large with high probability and, correspondingly, 2) Binary matrix conversion using and would yield a high probability of being ‘1’ within and if the corresponding is large. Let , , , and be the sparsity of , , , and , respectively. From the dot product operation in Eq. (6), the expression for pruning rate becomes

(7)

assuming that the probability of a bit being ‘0’ in and follows and . Then, = , which needs to be fine-tuned in practice. If and associated are given, then and are automatically determined by the target pruning rate. Subsequently, given and rank , it is necessary to find a certain that produces the best masking layer for pruning.

input : , rank , target sparsity
output : ,
1:  Generate the magnitude matrix using
2:  , = NMF(, )
3:  ,   ,  
4:  for  to  do
5:     Compute using Eq. (7)
6:     repeat
7:        Convert , into , w/
8:        Adjust depending on ()
9:     until 
10:     
11:     if  then
12:        ,   ,  
13:     end if
14:  end for
15:  Convert into w/
16:  Return ,
Algorithm 1 Binary pruning-index-data matrix factorization

In order to optimize , we define the cost function for pruning-index compression to be , where and (i.e., a sum of all unintentionally pruned weights’ magnitude by binary matrix decomposition). Algorithm 1 describes a method to find the best by sweeping and monitoring . Given the linear relationship between and , the algorithm can use binary search to expedite adjustment of .

2.2 MNIST Case Study

We applied our proposed pruning-index compression technique to the LeNet-5 model with the MNIST dataset. LeNet-5 consists of two convolutional layers and two fully-connected layers Han et al. [2015], Lee et al. [2018b]. Since the FC1 layer dominates the memory footprint (93%), we focus on the FC1 layer to investigate the effectiveness of our proposed method, but all layers are pruned with the same rates as in Han et al. [2015]. FC1 involves a weight matrix, where full rank is 500. Figure 2 plots , , and test accuracy across a range of values. Pruning rate for FC1 is 95% and rank is 16, 64, or 256. As increases, both and become more sparse and both and test accuracy improve.

Figure 2: , , and test accuracy for various when the FC1 layer’s pruning-index data of LeNet-5 model is factorized by Algorithm 1 using =0.95, = 16, 64, or 256.

After pre-training for 20K iterations, pruning via BMF is performed as described in Algorithm 1. Then, retraining runs until the 60Kth iteration using a masking layer . The test accuracy (the average of 20 runs) is measured at the 20Kth (right after pruning), 40Kth, 50Kth, and 60Kth iterations, as shown in Table 1 along with the compression ratio computed as . Compared with other binary pruning index formats (such as ternary quantization), our proposed binary index factorization yields much higher index compression ratios while maintaining reasonable test accuracy.

 

Rank Accuracy(%) Comp.
() 20Kth 40Kth 50Kth 60Kth Ratio

 

4 33.69 98.96 99.03 99.07 76.9
8 34.52 98.98 99.05 99.09 38.5
16 30.28 99.01 99.09 99.13 19.2
32 32.52 99.01 99.08 99.13 9.6
64 36.02 99.04 99.10 99.14 4.8
128 34.22 99.04 99.10 99.16 2.4
256 42.86 99.07 99.14 99.19 1.2

 

 

Method Index Size Comment

 

Binary 50.0KB 1bit/weight
CSR(16bit) 45.8KB
CSR(5bit) 14.3KB Relative
Viterbi 10.0KB 5X encoder
Proposed 2.6KB =16

 

Table 1: (Left): MNIST LeNet-5 accuracy using different rank . At the 20Kth iteration, we prune all layers using magnitude-based pruning method Han et al. [2015] except FC1 layer where pruning is performed by using Algorithm 1. Retraining is completed at the 60Kth iteration. The test accuracy of pre-trained model is 99.2%. (Right): Index size of FC1 is compared with various index formats. Accuracy is higher than 99.0% for all formats.
Figure 3: Histogram of unpruned weights of FC1 layer at 20Kth iteration using LeNet-5 model. Higher results in more near-zero weights pruned.

In general, the histogram of a weight matrix of a pre-trained model follows a Gaussian distribution Goodfellow et al. [2016]. For magnitude-based pruning, the count of near-zero values significantly reduces after pruning. To investigate which weights are pruned by different rank, Figure 3 presents the histogram of unpruned weight values of the FC1 layer of LeNet-5 model at the 20Kth iteration right after performing Algorithm 1 using the same pruning rate of 95%. Results show the count of near-zero weights reduces as rank increases. Since the total count is the same in Figure 3 for different rank , a low count of near-zero weights in the histogram implies that fewer weights of large magnitude are pruned. As such, there is a trade-off between the rank and accuracy even though accuracy drop is reasonable for a wide range of rank values as shown in Table 1.

Table 1 also compares index size for the FC1 layer using different formats while keeping pruning rate at 95.0%. Reduction in index size for CSR can be achieved using a relative index scheme as described in Han et al. [2016b]. In the case of Viterbi-based pruning, we assume the Viterbi decompressor has 40 outputs, 10-bit comparators, and a skip state of 1 Lee et al. [2018a]. Our proposed pruning method using binary matrix factorization yields the best compression ratio.

3 Index Matrix Tiling and Weight Manipulation to Lower Rank

If rank increases to drop more near-zero weights, compression ratio must be sacrificed. To decrease the count of near-zero weights after pruning for a fixed rank , we propose two techniques.

3.1 Tile-Based Binary Matrix Factorization

Weight matrix size increases quadratically with the number of neurons. If the whole binary matrix multiplication were to be performed within a chip, then on-chip memory size would be prohibitively large. In order to enhance scalability of our proposed factorization method, we propose tile-based factorization as illustrated in Figure 4. A () binary matrix is first tiled into multiple blocks. Then, each block is factorized independently with the same rank. Note that tiling size and/or reshaping the original binary index matrix can be varied to match the underlying parallel computing resources. Because NMF is solved by an iterative method in general, tiling not only reduces the required on-chip memory size but also reduces NMF computation time.

Tiling also affects the quality of binary matrix factorization. Note that when , , , are random samples of size from a distribution with mean and variance , the sample mean shows and . Indeed, Figure 4 depicts large variance of weight values after NMF when the number of tiles increases (a random Gaussian weight distribution is first assumed). Correspondingly, and also show larger variance with longer tails in the distribution with more blocks as shown in Figure 5. Such increased variance of and with smaller sample size is desirable for binary conversion from the NMF result. For example, compared with {0.98, 1.0, 1.2}, {0.5, 1.0, 1.5} presents larger spectrum of binary conversion threshold values for and (and hence, increases the chance to further optimize the cost function).

Figure 4: (Left): () binary matrix factorization using one or four tiles when the compression ratio is the same. (Right): Weight histogram before and after NMF using different number of tiles.
Figure 5: Histogram of and values using different number of tiles.
Figure 6: Histogram of unpruned FC1 layer weights of LeNet-5 with different block configurations. Each submatrix is factorized by NMF using different rank depending on the tiling plan to present the same overall compression ratio. Rank 128, 64 or 32 is used for (), (), or () tiling, respectively.

Figure 6 plots histograms of unpruned FC1 layer weights of the LeNet-5 model on MNIST across different tiling methods. FC1 weight matrix (of size) is tiled into 1(), 4(), or 16() submatrices while the index compression ratio is the same for all three tiling cases. Since the size of a submatrix differs for each partitioning plan, the rank for factorizing a submatrix is accordingly adjusted in Figure 6. Notice that increasing the number of blocks yields deeper drops of near-zero weights. However, if submatrices have different properties (e.g., an embedding matrix in natural language models) by tiling, each submatrix can optimally choose a different rank.

3.2 Weight Magnitude Manipulation

If is given as a magnitude-sum of unintentionally pruned weights, then it is still possible that a large weight can be pruned. To prevent large weights from being pruned and to keep the definition of , the magnitude of weights can be pre-processed. For example, in the context of magnitude-based pruning, artificially increasing already large weight values can futher lower their probability of being pruned. Note that such weight magnitude manipulation is only temporarily used for pruning-index data compression, and not for normal training or inference steps.

Figure 7 plots histograms of unpruned weights of the FC1 layer of LeNet-5 using different weight-magnitude manipulation methods. In Method 3, weights larger than a threshold are multiplied by . We observe the sharpest drop of weights around the threshold value and higher count of large weights in Method 3. Finding the best weight manipulation method is an open problem.

Figure 7: Histogram of unpruned FC1 layer weights of LeNet-5 with different weight magnitude manipulation methods (Method 1: no manipulation, Method 2: , Method 3: is amplified by if is larger than a threshold value (given by a magnitude-based pruning), where is the sparsity of .

4 Experimental Results

We verify our index data compression technique using ResNet32 He et al. [2016] on CIFAR10, AlexNet Krizhevsky et al. [2012] on ImageNet, and a recurrent neural network (RNN) model on the PTB dataset Marcus et al. [1994], as shown in Table 2. We do not apply binary matrix factorization to layers if their parameter size is relatively small. Such layers usually present small pruning rates Han et al. [2015], Zhu and Gupta [2017] and index compression ratio is also small Yu et al. [2017]. For example, Viterbi-based index compression techniques yield 1 compression ratios for the index data of convolutional layers in AlexNet Lee et al. [2018a].

For ResNet32, our pruning-index compression achieves a 70% pruning rate with 91.8% accuracy, providing over 3 compression as shown in Table 2 (experimental data using various rank selections id provided in Appendix). Note that binary pruning-index matrix decomposition is performed after multiplying the weights (exceeding the pruning threshold) by , as a method to drop more near-zero weights.

For AlexNet, we focus on compressing index data of FC5 and FC6 layers occupying 90% of the total model size. Both layers are pruned to a pruning rate of 91% Han et al. [2015] using Algorithm 1. We achieve over 8 compression in the pruning-index data while maintaining full-precision accuracy. FC5 and FC6 weights are tiled into small blocks, given their large matrix size. BMF of Algorithm 1 is performed on those blocks with ranks 32 and 64, resulting in speed up and reduced .

An RNN model with one long-short term memory (LSTM) layer of size 300 Xu et al. [2018] on the PTB dataset, with performance measured using Perplexity Per Word (PPW), is pruned by our pruning-index-compression method. Note that the embedding and softmax matrices usually take a major memory footprint because of increasing vocabulary size in a neural language model while these two matrices have several distinguished properties compared with general weight matrices Chen et al. [2018].

To compare the compression ratio of various sparse matrix representation schemes, we choose AlexNet FC5 and FC6 layers and Table 3 shows the index size of a binary matrix representation, CSR format with 16-bit indexing, CSR format with 5-bit indexing (relative indexing as introduced in Han et al. [2016b]), Viterbi-based representation Lee et al. [2018a], and our proposed representation. Our proposed network pruning algorithm and index representation using binary matrix factorization significantly reduce the amount of indexing data while maintaining sparsity at the same level as fine-grained pruning. Lastly, while Viterbi-based compression only allows for integer-valued compression ratios, our proposed compression technique enables a much wider range of compression ratios (as rational numbers) by controlling the rank for index pruning.


 

Pre-trained Model Pruned Model using the Proposed Method
Model Size Accuracy Rank Comp. Ratio Accuracy

 

ResNet32
on CIFAR10
460.76K
92.5%
0.70
8/16/321
3.09
91.8%
8/8/8
5.12
91.5
%
AlexNet
on ImageNet
(FC5, FC6)
9K4K
(FC5)
4K4K
(FC6)
80.3% (top5)
57.6% (top1)
0.91
32
tiled2
8.20
80.4% (top5)
57.1% (top1)
80.3% (top5)
57.2% (top1)
64
tiled2
4.14

LSTM on PTB
6.41M
89.6 PPW
0.60
145
1.82
89.0 PPW

 

  • Different ranks are applied to 3 groups of layers according to the number of input channels (16, 32, or 64).

  • The FC5 and FC6 layers are tiled to 168 blocks (of 576512 size) and 88 blocks (of 512512 size), respectively.

Table 2: Compression ratio and accuracy of the proposed pruning-index compression method on various DNN models.

 

Method FC5 Index Size FC6 Index Size Sum Comment
Binary 4608KB 2048KB 6656KB 1bit/weight
CSR(16bit) 6962KB 3099KB 10061KB
CSR(5bit) 2176KB 968KB 3144KB Relative Indexing
Viterbi 922KB 410KB 1331KB 5X Encoder
Proposed 556KB 256KB 812KB =32, tiled

 

Table 3: Index size of FC5 and FC6 layers of AlexNet is compared with various index formats when is to be the same as 0.91 for both FC5 and FC6. Top-1 accuracy is higher than 57.0% for all formats.

5 Conclusion

This paper proposes a fine-grained network pruning technique to produce low-rank binary index matrices. We confirm various DNN models can be pruned by our low-rank indexing scheme while preserving sparsity. Such binary matrix multiplication enables not only high compression ratios, but highly parallel operations of sparse networks. Our proposed tiling method and weight magnitude manipulation schemes further lower rank. Unlike previous studies, this work demonstrates that fine-grained pruning can be represented by a highly regular structured format.

References

  • Ahn et al. [2019] D. Ahn, D. Lee, T. Kim, and J.-J. Kim. Double Viterbi: Weight encoding for high compression ratio and fast on-chip reconstruction for deep neural network. In International Conference on Learning Representations (ICLR), 2019.
  • Chen et al. [2018] P. Chen, S. Si, Y. Li, C. Chelba, and C.-J. Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, 2018.
  • Dean et al. [2012] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
  • Frankle and Carbin [2019] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • Guo et al. [2016] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, 2016.
  • Han et al. [2015] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
  • Han et al. [2016a] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254, 2016a.
  • Han et al. [2016b] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016b.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • He et al. [2017] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • LeCun et al. [1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
  • Lee et al. [2018a] D. Lee, D. Ahn, T. Kim, P. I. Chuang, and J.-J. Kim. Viterbi-based pruning for sparse matrix with fixed and high index compression ratio. In International Conference on Learning Representations (ICLR), 2018a.
  • Lee et al. [2018b] D. Lee, P. Kapoor, and B. Kim. Deeptwist: Learning model compression via occasional weight distortion. arXiv:1810.12823, 2018b.
  • Lee and Seung [1999] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
  • Liu et al. [2019] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019.
  • Marcus et al. [1994] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, pages 114–119, 1994.
  • Molchanov et al. [2017] D. Molchanov, A. Ashukha, and D. P. Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning (ICML), pages 2498–2507, 2017.
  • Narang et al. [2017] S. Narang, E. Undersander, and G. F. Diamos. Block-sparse recurrent neural networks. arXiv:1711.02782, 2017.
  • Wen et al. [2016] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2082–2090, 2016.
  • Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multi-bit quantization for recurrent neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Yu et al. [2017] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 548–560, 2017.
  • Zhang et al. [2016] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 20:1–20:12, 2016.
  • Zhang et al. [2007] Z. Zhang, T. Li, C. H. Q. Ding, and X. Zhang. Binary matrix factorization with applications. In IEEE International Conference on Data Mining, 2007.
  • Zhu and Gupta [2017] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017.
  • Zitnik and Zupan [2012] M. Zitnik and B. Zupan. Nimfa: A python library for nonnegative matrix factorization. Journal of Machine Learning Research, 13:849–853, 2012.

Appendix A Supplementary Experimental Results on ResNet32 using CIFAR-10

Table 4 lists the accuracy of ResNet32 using various ranks and pruning rates. The pruned models using our proposed binary matrix decomposition show slightly degraded accuracy compared to the baseline pruning method (the bottom row of Table 4). It can be observed that in general, there exists a trade-off between the ranks and compression ratios.


Pruning Rate ()
Rank1 Comp. Ratio 0.60 0.70 0.80
4/4/4 10.29 92.0% 91.5% 90.5%
4/8/16 6.74 92.2% 91.5% 90.9%
8/8/8 5.12 92.0% 91.5% 90.9%
8/16/32 3.09 92.4% 91.8% 91.1%
16/16/16 2.56 92.2% 91.8% 91.0%
16/32/64 1.55 92.3% 91.7% 91.2%
w/o BMF 1 92.4% 92.2% 92.0%
  • Different ranks are applied to 3 groups of layers according to the number of input channels (16, 32, or 64).

  • Retraining is performed for 65K iterations.

Table 4: Model accuracy of ResNet32 with CIFAR-10 using different ranks and pruning rates. The bottom row presents the results of baseline pruning method without binary matrix factorization.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
363268
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description