Pushing the limits of RNN Compression

# Pushing the limits of RNN Compression

Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou
Ganesh Dasika ,Matthew Mattina
Arm ML Research Lab
Currently at AMD Research
###### Abstract

Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by with minimal accuracy loss. We show that KP can beat the task accuracy achieved by other state-of-the-art compression techniques across 4 benchmarks spanning 3 different applications, while simultaneously improving inference run-time.

## 1 Introduction

Recurrent Neural Networks (RNNs) achieve state-of-the-art (SOTA) accuracy for many applications that use time-series data. As a result, RNNs can benefit important Internet-of-Things (IoT) applications like wake-word detection Zhang et al. (2017), human activity recognition Hammerla et al. (2016); Roggen et al. (2010), and predictive maintenance. IoT applications typically run on highly constrained devices. Due to their energy, power, and cost constraints, IoT devices frequently use low-bandwidth memory technologies and smaller caches compared to desktop and server processors. For example, some IoT devices have 2KB of RAM and 32 KB of Flash Memory. The size of typical RNN layers can prohibit their deployment on IoT devices or reduce execution efficiency Thakker et al. (2019b). Thus, there is a need for a compression technique that can drastically compress RNN layers without sacrificing the task accuracy.

First, we study the efficacy of traditional compression techniques like pruning Zhu and Gupta (2017) and low-rank matrix factorization (LMF) Kuchaiev and Ginsburg (2017); Grachev et al. (2017). We set a compression target of or more and observe that neither pruning nor LMF can achieve the target compression without significant loss in accuracy. We then investigate why traditional techniques fail, focusing on their influence on the rank and condition number of the compressed RNN matrices. We observe that pruning and LMF tend to either decrease matrix rank or lead to ill-condition matrices and matrices with large singular values.

To remedy the drawbacks of existing compression methods, we propose to use Kronecker Products (KPs) to compress RNN layers. We refer to the resulting models as KPRNNs. We are able to show that our approach achieves SOTA compression on IoT-targeted benchmarks without sacrificing wall clock inference time and accuracy.

## 2 Related work

KPs have been used in the deep learning community in the past Jose et al. (2017); Zhou and Wu (2015). For example, Zhou and Wu (2015) use KPs to compress fully connected (FC) layers in AlexNet. We deviate from Zhou and Wu (2015) by using KPs to compress RNNs and, instead of learning the decomposition for fixed RNN layers, we learn the KP factors directly. Additionally, Zhou and Wu (2015) does not examine the impact of compression on inference run-time. In Jose et al. (2017), KPs are used to stabilize RNN training through a unitary constraint. A detailed discussion of how the present work differs from Jose et al. (2017) can be found in Section 3.

The research in neural network (NN) compression can be roughly categorized into 4 topics: pruning Zhu and Gupta (2017), structured matrix based techniques Cheng et al. (2015), quantization Hubara et al. (2016); Gope et al. (2019) and tensor decomposition Kuchaiev and Ginsburg (2017); Thakker et al. (2019a). Compression using structured matrices translates into inference speed-up, but only for matrices of size and larger Thomas et al. (2018) on CPUs or when using specialized hardware Cheng et al. (2015). As such, we restrict our comparisons to pruning and tensor decomposition.

## 3 Kronecker Product Recurrent Neural Networks

### 3.1 Background

Let , and . Then, the KP between and is given by

 A=B⊗C (1)
 A=⎡⎢ ⎢ ⎢ ⎢⎣b\textsubscript1,1∘Cb\textsubscript1,2∘C...b\textsubscript1,$n1$∘Cb\textsubscript2,1∘Cb\textsubscript2,2∘C...b\textsubscript2,$n1$∘C....b\textsubscript$m1$,1∘Cb\textsubscript1,2∘C...b\textsubscript$m1$,$n1$∘C⎤⎥ ⎥ ⎥ ⎥⎦

where, , , and is the hadamard product. The variables B and C are referred to as the Kronecker factors of A. The number of such Kronecker factors can be 2 or more. If the number of factors is more than 2, we can use (1) recursively to calculate the resultant larger matrix. For example, in the following equation -

 W=W1⊗W2⊗W3 (2)

W can be evaluated by first evaluating to a partial result, say , and then evaluating .

Expressing a large matrix A as a KP of two or more smaller Kronecker factors can lead to significant compression. For example, can be decomposed into Kronecker factors and . The result is a reduction in the number of parameters required to store . Of course, compression can lead to accuracy degradation, which motivates the present work.

### 3.2 Prior work on using KP to stabilize RNN training flow

Jose et al. Jose et al. (2017) used KP to stabilize the training of vanilla RNN. An RNN layer has two sets of weight matrices - input-hidden and hidden-hidden (also known as recurrent). Jose et al. Jose et al. (2017) use Kronecker factors of size to replace the hidden-hidden matrices of every RNN layer. Thus a traditional RNN cell, represented by:

 ht=f([WxWh]∗[xt;ht−1]) (3)

is replaced by,

 ht=f([WxW0⊗W1...⊗WF−1]∗[xt;ht−1]) (4)

where (input-hidden matrix) , (hidden-hidden or recurrent matrix) , for , , , and . Thus a sized matrix is expressed as a KP of 8 matrices of size . For an RNN layer with input and hidden vectors of size 256, this can potentially lead to compression (as we only compress the matrix). The aim of Jose et al. Jose et al. (2017) was to stabilize RNN training to avoid vanishing and exploding gradients. They add a unitary constraint to these matrices, stabilizing RNN training. However, in order to regain baseline accuracy, they needed to increase the size of the RNN layers significantly, leading to more parameters being accumulated in the matrix in (4). Thus, while they achieve their objective of stabilizing vanilla RNN training, they achieve only minor compression (). In this paper, we show how to use KP to compress both the input-hidden and hidden-hidden matrices of vanilla RNN, LSTM and GRU cells and achieve significant compression (). We show how to choose the size and the number of Kronecker factor matrices to ensure high compression rates , minor impact on accuracy, and inference speed-up over baseline on an embedded CPU.

### 3.3 KPRNN Layer

#### Choosing the number of Kronecker factors:

A matrix expressed as a KP of multiple Kronecker factors can lead to significant compression. However, deciding the number of factors is not obvious. We started by exploring the framework of Jose et al. (2017). We used Kronecker factor matrices for hidden-hidden/recurrent matrices of LSTM layers of the key-word spotting network Zhang et al. (2017). This resulted in an approximately reduction in the number of parameters. However, the accuracy dropped by 4% relative to the baseline. When we examined the matrices, we observed that, during training, the values of some of the matrices hardly changed after initialization. This behavior may be explained by the fact that the gradient flowing back into the Kronecker factors vanishes as it gets multiplied with the chain of matrices during back-propagation. In general, our observations indicated that as the number of Kronecker factors increased, training became harder, leading to significant accuracy loss when compared to baseline.

Additionally, using a chain of matrices leads to significant slow-down during inference on a CPU. For inference on IoT devices, it is safe to assume that the batch size will be one. When the batch size is one, the RNN cells compute matrix vector products during inference. To calculate the matrix-vector product, we need to multiply and expand all of the to calculate the resultant larger matrix, before executing the matrix vector multiplication. Referring to (4), we need to multiply to create before executing the operation . The process of expanding the Kronecker factors to a larger matrix, followed by matrix-vector products, leads to a slower inference than the original uncompressed baseline. Thus, inference for RNNs represented using (3) is faster than the compressed RNN represented using (4). The same observation is applicable anytime the number of Kronecker factors is greater than . The slowdown with respect to baseline increases with the number of factors and can be anywhere between .

However, if the number of Kronecker factors is restricted to two, we can avoid expanding the Kronecker factors into the larger matrix and achieve speed-up during inference. Algorithm 1 shows how to calculate the matrix vector product when the matrix is expressed as a KP of two Kronecker factors. The derivation of this algorithm can be found in Nagy (2010).

#### Choosing the dimensions of Kronecker factors:

A matrix can be expressed as a KP of two Kronecker factors of varying sizes. The compression factor is a function of the size of the Kronecker factors. For example, a matrix can be expressed as a KP of and matrices, leading to a reduction in the number of parameters used to store the matrix. However, if we use Kronecker factors of size and , we achieve a compression factor of . In this paper, we choose the dimensions of the factors to achieve maximum compression using Algorithm 2.

#### Compressing LSTMs, GRUs and RNNs using the KP:

KPRNN cells are RNN, LSTM and GRU cells with all of the matrices compressed by replacing them with KPs of two smaller matrices. For example, the RNN cell depicted in (3) is replaced by:

 KPRNNcell:ht=f((W1⊗W2)∗[xt;ht−1]) (5)

where , , , , and . LSTM, GRU and FastRNN cells are compressed in a similar fashion. Instead of starting with a trained network and decomposing its matrices into Kronecker factors, we replace the RNN/LSTM/GRU cells in a NN with its KP equivalent and train the entire model from the beginning.

## 4 Results

#### Other compression techniques evaluated:

We compare networks compressed using KPRNN with three techniques - pruning, LMF and Small Baseline.

#### Training platform, infrastructure, and inference run-time measurement:

We use Tensorflow 1.12 as the training platform and 4 Nvidia RTX 2080 GPUs to train our benchmarks. To measure the inference run-time, we implement the baseline and the compressed cells in C++ using the Eigen library and run them on the Arm Cortex-A73 core of a Hikey 960 development board.

#### Dataset and benchmarks:

We evaluate the impact of compression using the techniques discussed in Section 3 on a wide variety of benchmarks.Table 1 shows the benchmarks used in this work.

### 4.1 KPRNN networks

Table 2 shows the results of applying the KP compression technique across a wide variety of applications and RNN cells. As mentioned in Section 3, we target the point of maximum compression using two matrix factors.

### 4.2 Possible explanation for the accuracy difference between KPRNN, pruning, and LMF

In general, the poor accuracy of LMF can be attributed to significant reduction in the rank of the matrix (generally ). KPs, on the other hand, will create a full rank matrix if the Kronecker factors are fully ranked Laub (2005). We observe that, Kronecker factors of all the compressed benchmarks are fully-ranked. A full-rank matrix can also lead to poor accuracy if it is ill-conditioned. However, the condition numbers of the matrices of the best-performing KP compressed networks discussed in this paper are in the range of to . To prune a network to the same compression factor as KP, networks need to be pruned to 94% sparsity or above. Pruning FastRNN cells to the required compression factor leads to an ill-conditioned matrix. This may explain the poor accuracy of sparse FastRNN networks. However, for other pruned networks, the resultant sparse matrices have a condition number less than and are fully-ranked. Thus, condition number does not explain the loss in accuracy for these benchmarks. To further understand the loss in accuracy of pruned LSTM networks, we looked at the singular values of the resultant sparse matrices in the KWS-LSTM network. Let . The largest singular value of upper-bounds , i.e. the amplification applied by . Thus, a matrix with larger singular value can lead to an output with larger norm Trefethen,Lloyd and Bau (1997). Since RNNs execute a matrix-vector product followed by a non-linear sigmoid or tanh layer, the output will saturate if the value is large. The matrix in the LSTM layer of the best-performing pruned KWS-LSTM network has its largest singular value in the range of to while the baseline KWS-LSTM network learns a LSTM layer matrix with largest singular value of and the Kronecker product compressed KWS-LSTM network learns LSTM layers with singular values less than . This might explain the especially poor results achieved after pruning this benchmark. Similar observations can be made for the pruned HAR1 network.

## 5 Conclusion

We show how to compress RNN Cells by to using Kronecker products. We call the cells compressed using Kronecker products as KPRNNs. KPRNNs can act as a drop in replacement for most RNN layers and provide the benefit of significant compression with marginal impact on accuracy. None of the other compression techniques (pruning, LMF) match the accuracy of the Kronecker compressed networks. We show that this compression technique works across 5 benchmarks that represent key applications in the IoT domain.

## References

• [1] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S. Chang (2015-12) An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2857–2865. External Links: Document, ISSN 2380-7504 Cited by: §2.
• [2] D. Gope, G. Dasika, and M. Mattina (2019) Ternary hybrid neural-tree networks for highly constrained iot applications. CoRR abs/1903.01531. External Links: Link, 1903.01531 Cited by: §2.
• [3] A. M. Grachev, D. I. Ignatov, and A. V. Savchenko (2017) Neural networks compression for language modeling. In Pattern Recognition and Machine Intelligence, B. U. Shankar, K. Ghosh, D. P. Mandal, S. S. Ray, D. Zhang, and S. K. Pal (Eds.), Cham, pp. 351–357. External Links: ISBN 978-3-319-69900-4 Cited by: §1.
• [4] N. Y. Hammerla, S. Halloran, and T. Ploetz (2016) Deep, convolutional, and recurrent models for human activity recognition using wearables. IJCAI 2016. Cited by: §1, Table 1.
• [5] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Quantized neural networks: training neural networks with low precision weights and activations. CoRR abs/1609.07061. External Links: Link, 1609.07061 Cited by: §2.
• [6] J. J. Hull (1994-05) A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (5), pp. 550–554. External Links: Document, ISSN 0162-8828 Cited by: Table 1.
• [7] C. Jose, M. Cissé, and F. Fleuret (2017) Kronecker recurrent units. CoRR abs/1705.10142. External Links: Link, 1705.10142 Cited by: §2, §3.2, §3.3.
• [8] O. Kuchaiev and B. Ginsburg (2017) Factorization tricks for LSTM networks. CoRR abs/1703.10722. External Links: Link, 1703.10722 Cited by: §1, §2.
• [9] A. Kusupati, M. Singh, K. Bhatia, A. Kumar, P. Jain, and M. Varma (2019) FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. CoRR abs/1901.02358. External Links: Link, 1901.02358 Cited by: Table 1.
• [10] A. J. Laub (2005) Matrix analysis for scientists and engineers. Vol. 91, Siam. Cited by: §4.2.
• [11] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219 Cited by: Table 1.
• [12] J. Nagy (2010) Introduction to kronecker products. Note: http://www.mathcs.emory.edu/~nagy/courses/fall10/515/KroneckerIntro.pdfAccessed: 2019-05-20 Cited by: §3.3.
• [13] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Förster, G. Tröster, P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha, J. Doppler, C. Holzmann, M. Kurz, G. Holl, R. Chavarriaga, H. Sagha, H. Bayati, M. Creatura, and J. d. R. Millàn (2010-06) Collecting complex activity datasets in highly rich networked sensor environments. In 2010 Seventh International Conference on Networked Sensing Systems (INSS), Vol. , pp. 233–240. External Links: Document, ISSN Cited by: §1, Table 1.
• [14] U. Thakker, J. G. Beu, D. Gope, G. Dasika, and M. Mattina (2019) Run-time efficient RNN compression for inference on edge devices. CoRR abs/1906.04886. External Links: Link, 1906.04886 Cited by: §2.
• [15] U. Thakker, G. Dasika, J. G. Beu, and M. Mattina (2019) Measuring scheduling efficiency of rnns for NLP applications. CoRR abs/1904.03302. External Links: Link, 1904.03302 Cited by: §1.
• [16] A. Thomas, A. Gu, T. Dao, A. Rudra, and C. Ré (2018) Learning compressed transforms with low displacement rank. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9066–9078. External Links: Link Cited by: §2.
• [17] Trefethen,Lloyd and D. Bau (1997) Numerical linear algebra. SIAM: Society for Industrial and Applied Mathematics. Cited by: §4.2.
• [18] P. Warden (2018) Speech commands: A dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209. External Links: Link, 1804.03209 Cited by: Table 1.
• [19] Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. CoRR abs/1711.07128. External Links: Link, 1711.07128 Cited by: §1, §3.3, Table 1.
• [20] S. Zhou and J. Wu (2015) Compression of fully-connected layer in neural network by kronecker product. CoRR abs/1507.05775. External Links: Link, 1507.05775 Cited by: §2.
• [21] M. Zhu and S. Gupta (2017-10) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv e-prints, pp. arXiv:1710.01878. External Links: 1710.01878 Cited by: §1, §2.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters