# Weight-Importance Sparse Training in Keyword Spotting

###### Abstract

Large size models are implemented in recently ASR system to deal with complex speech recognition problems. The number of parameters in these models makes them hard to deploy, especially on some resource-short devices such as car tablet. Besides this, at most of time, ASR system is used to deal with real-time problem such as keyword spotting (KWS). It is contradictory to the fact that large model requires long computation time.

To deal with this problem, we apply some sparse algorithms to reduces number of parameters in some widely used models, Deep Neural Network (DNN) KWS, which requires real short computation time. We can prune more than 90 % even 95% of parameters in the model with tiny effect decline. And the sparse model performs better than baseline models which has same order number of parameters. Besides this, sparse algorithm can lead us to find rational model size automatically for certain problem without concerning choosing an original model size.

WEIGHT-IMPORTANCE SPARSE TRAINING IN KEYWORD SPOTTING

Sihao Xue, Zhenyi Ying, Fan Mo, Min Wang, Jue Sun |

NIO Co., Ltd |

{sihao.xue, zhenyi.ying, fan.mo, min.wang, jue.sun}@nio.com |

Index Terms— Speech Recognition, Sparse Model, Keyword Spotting

## 1 Introduction

In recent years, great progress has been achieved in speech recognition. The main reason is the usage of the large neural network trained on large-scale datasets. People usually design networks with a large number of parameters to build a recognition model. However, this requires massive computation and memory capacity, which are limited to some machines which do not have strong computation ability and enough storage space such as car tablet.

Most of the time, people use ASR system to deal with real-time problem, long computation time is unacceptable. It is contradictory to the fact that large neural network usually requires more time to run. Especially in footprint keyword spotting (KWS), small storage space, short computation time and low CPU usage are required.

Besides these, how to design an appropriate neural network architecture is a classic question in deep learning. Most of the time, selecting some empirical architecture and adjusting parameters is usually implemented. This method usually leads to big network than the true optimum one, resources waste and overfitting problem.

Motivated by this, we implemented sparse model on our KWS model. We implement several algorithms to prune the model. The final model has similar performance with original one but has thinner structure and requires less computational complexity.

## 2 Related Work

There have been several research to reduce the network size by pruning the model. Li et al [1] use sparse shrink model to prune a CNN model. Han et al [2][3] prune the small-weight connections to prune a CNN model. Narang et al [4][5] apply sparsity algorithm to prune RNN. Hassibi et al [6] and Yann et al [7] uses Hessian-based approach to prune weight.

There are also several literature on the topic of KWS. Offline Large Vocabulary Continuous Speech Recognition (LVCSR) systems can be used for detecting the keywords of interest. [8][9]. And Hidden Markov Models (HMM) are commonly used for online KWS systems [10][11]. In traditional, Gaussian Mixture Models (GMM) is used in acoustic modeling under the HMM framework. It is replaced by Deep Neural Network (DNN) with time goes on [12]. And several architectures have been applied [13][14]

## 3 Keyword Spotting

KWS is the entrance of ASR. It provides interactive intention for subsequent recognition problem. In general, KWS works on local devices and processes voice data collected by the microphone so that short delay time and low memory storage are required to ensure user experience and acceptable consumption. The early KWS is based on offline continuous speech recognition with GMM-HMM [10][11]. With the great success of Deep Neural Network in continuous speech recognition, traditional GMM-HMM is replaced by DNN[12]. Recently, Chen et al [15] design a KWS strategy without HMM.

In our research, we use finite state transducer (FST) to realize KWS by employing word unit. FST consists of a finite number of states. Each state is connected by transition labeled with input/output pair. Its states transition is depending on input and transition rules. For example, “happy”. First search ”happy” in the dictionary for its phone units, which is “HH AE1 P IY0”. Then find its tri-phone such as h-ay1-p which may occur in voice data and do clustering to generate each state. During clustering, we let the tri-phones whose central phone is “HH” and “AE1” as the first word, “P” and “IY0” as the second word. The ”happy” FST is shown in Figure 1. The expression is input/output pair, the arrow means state transformation. Device wakes up when the output equals 1. Obviously, it would wake up if and only if “happy” occurs.

## 4 Sparse Model Algorithm

In this section, we elaborate how we implement the sparse algorithms in KWS problems.

### 4.1 Pruning Algorithm

#### 4.1.1 Pruning Based on Weight Magnitude

The most simple and naive method is pruning the network based on weight magnitude. This algorithm assumes that small magnitude equals little importance, so we can delete some small weight with small magnitude. For this algorithm, we have several different options to implements:

1. Delete certain proportion number of remain weights after several iterations.

2. Delete a certain number of weights after several iterations.

3. Delete weights whose magnitude is less than a certain threshold.

4. Delete weights whose magnitude is less than a certain threshold but the percentage number of deleted weights must be less than a certain percentage.

Continuing the train after each pruning operation to let the network re-converge. Besides pruning algorithm, selection of learning rate after pruning has an influence on final model performance. It is easy to imagine that learning rate has to be set relatively large because the model has changed significantly after one pruning iteration.

Actually, this algorithm is partly similar to normalization. This algorithm will reduce normalization value as its elimination on some weight parameters.

But the assumption of this algorithm may lead to some problems. It is short of convincing to believe that less magnitude means less importance. It may be that some weight magnitude is small for its large input from the last layer. So this simple algorithm may destroy the neural network.

#### 4.1.2 Pruning Based on Affine Transformation Value

For each node, its input is affine transformation output of last stage layer nodes:

(1) |

Comparing with large magnitude input, neglecting small magnitude input is an obvious way of pruning the network. The computing process is shown in Figure.3. This algorithm has same options to implements as Pruning Based on Weight Magnitude method which is mentioned before. The difference is this method based on the affine transformation value. And it also requires continuing the train after each pruning operation to let the network re-converge.

#### 4.1.3 Pruning Based on Dictionary Algorithm

Dictionary algorithm aims to keep important weights in network only. it assumes that weight importance is its contribution for the whole network statistically. If the input dataset is D, it has N frames. The importance I of w is:

(2) |

where k is index of input data.

After pruning less important weight. The dictionary algorithm revises the network by using important weights to evaluate the value of unimportant weights. For easy to compute, it revises the network only based on the unimportant one which has the biggest importance among all unimportant weights of one node. For example, for one node node, the importance of this node is I, j is the index of last layer node. These value is divided into two set: important set {I, I…} and unimportant set {I, I….}, where im means important and unim means unimportant. Then find the unimportant weight with maximum importance:

(3) |

Then revising other weights according their importance:

(4) |

where we assign the revision value to every important weight equally. The compute process is showed in Figure.4.

#### 4.1.4 Pruning Based on Optimal Brain Damage

This algorithm is proposed by Yann et al[7]. Its main idea is based on second-order derivation of loss. The assumption is introduced that E is caused by deleting each parameter individually. This assumption decreases the computation notable because it requires a large resource to compute Hessian Matrix and its inverse matrix.

#### 4.1.5 Pruning Based on Optimal Brain Surgeon

This algorithm is proposed by Hassibi et al [6]. The main difference between OBS and OBD is that OBS uses whole Hessian Matrix to prune the network. It requires larger resource to compute and may not be suitable for the big models nowadays.

It is noteworthy that 4.1.2, 4.1.3, 4.1.4, 4.1.5 relies on training data. 4.1.2 and 4.1.3 require affine transformation value. 4.1.4 and 4.1.5 require loss value, so it is important to pay more attention to the distribution of input data. If the training data is bias, the pruning decision will be bias.

### 4.2 Matrix Decomposition

After pruning the network, it is a problem how to implement the sparse network in practice. The easiest idea is setting pruned weight to zero. But neither store storage nor computation requirement is reduced by this method. Another thought is designing a map to connect related node. But it may be difficult to realize and errors occurring probability is relatively high. We implement a compromising method, matrix decomposition, to deal with this awkward situation.

if we have two matrices whose size is and (the number of parameter is ), the product of them is (the number of parameters is ). When we use and matrices to replace a matrix, the store and computation requirements is reduced if the r is relative small.

To decompose a matrix, we design two networks showed in Figure.5.

Network(1) uses original matrix as input and decomposed matrices and as output. The loss function can be defined as:

(5) |

Network(2) uses decomposed matrices and as affine transform component in network. The input is one-hot vector and output is the corresponding column of the original matrix.

(6) |

The index of 1 in is corresponding to column index of destination matrix. Network (1) requires large computation resources because of its large size of input and output.

(7) |

(8) |

where INPARNUM, OUTPARNUM, NR, NL and RANK is input dimension, output dimension, original matrix rows, original matrix column and destination rank. It needs huge resource to compute. So we decide to implement network (2).

Actually, this operation is similar to bottleneck and sparse operation is equivalent to the pre-training process.

## 5 Experiment

We implement our sparse algorithm in KWS problem. We use Google Speech Command Dataset [16] as train and test data. We choose happy (1742 utterances) as keyword and use a 23h environment data to test FA. To get a robust model, we mix noise into original voice data. And we use KALDI toolkit [17] to train each model in our experiment.

We use 3-layer and 4-layer DNN as baseline model. Each model has 858 input dimension and 3 output dimension.

### 5.1 Sparse Algorithm

We apply Sparse Algorithm 4.1.1, 4.1.2 and 4.1.3 for each baseline model of each keyword respectively.

Then we use environment and test dataset to evaluate the True Alarm (TA) and False Alarm (FA) of each model. And the ROC is showed in Figure.6.

From (1) - (6), the performance of sparse model does not decline much. In some situation, the sparse model even performs better, especially with large parameter remain rate. And all sparse model performs much better than full connected model whose total parameters equals to sparse model. The reason may be that the topology structure difference between sparse model and fully connected network. Although the sparse model has a rare number of parameter, its node number is still large. This property keeps high information reservation and emphasizes important part.

### 5.2 Matrix Decomposition

We have decomposed the matrices in affine component of “happy” A1 model. The performance declines obviously. Its FA is almost 10 when recall rate is 0.93.

We assume two main reasons cause the result. As we mention before, We use matrices and to replace sparse matrix . We calculate the rank of and most of them are full rank. But the rank of and production is not greater than . So this operation leads loss into the model. Lower rank, higher loss. Another reason is insufficient training data. The model has tens of thousands parameter. The training data may be insufficient to train. We test the frame accuracy of each model and it does not decrease much. So the overfitting is serious. The frame accuracy of each model is shown in Table.1.

Model | Frame Acc. |
---|---|

Then we do this experiment on our own keyword. We use more than 50K utterances to train the model and test the result.

It performs better than “happy” obviously. And the decomposition sparse models even do not re-converge because of its long computation time. The performance is shown in Figure.7. Models’ size is shown in Table.2

Model | Model Size |
---|---|

When remain rate is high enough, DCS performs relative good. The model size is much smaller than original baseline model. It is worth to be considered when the model has to be implemented on some source-short device such as car tablet.

But when remain rate is lower, DCS performs really bad. Besides not re-converge, the bottleneck topology causes a big loss to the network.

## 6 Conclusion

We have implemented sparse model into KWS problem and the performance is acceptable. And the result of sparse model can be used to guide model selection. But how to implement it into a real project is next challenge. Matrix Decomposition may lead big loss into the model. So its application scenarios are restricted. We will try to find a method to realize sparse model in general ASR system in the future. Besides this, we will also do more research on implementing sparse algorithm on other models such as CNN and LSTM.

## 7 Reference

[1] Xin Li and Changsong Liu, “Prune the Convolutional Neural Networks With Sparse Shrink”, arXiv: 1708.02439.

[2] Song Han, Jeff Pool, John Tran and Wiliiam J.Dally, “Learning Both Weights And Connections For Efficient Neural Networks”, Advances in Neural Information Processing System (NIPS), December 2015.

[3] Song Han, Huizi mao and William J.Dally, “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding”, International Conference on Learning Representations(ICLR), May 2016.

[4] Sharan Narang, Erich Elsen, Greg Diamos and Shubho Sengupta. “Exploring Sparsity in Recurrent Neural Networks”, arXiv: 1704.05119.

[5] Sharan Narang, Eric Undersander and Gregory Diamos, “Block-Sparse Recurrent Neural Networks”, arXiv: 1711.02782.

[6] Babak Hassibi and David G Stork. “Second Order Derivatives for Network Pruning: Optimal Brain Surgeon”, Morgan Kaufmann 1993.

[7] Yann Le Cun, John S, Denker and Sara A. Solla, “Optimal Brain Damange”.

[8] David RH Miller, Michael Kleber, Chia-Lin Kao, Owen Kimball, Thomas Colthurst, Stephen A Lowe, Richard M Schwartz and Herbert Gish, “Rapid and accurate spolen term detection”, in Eighth Annual Conference of the International Speech Communication Association 2007

[9] Siddika Parlak and Murat Saraclar, “Spoken term detection for turkish broadcast news”, in Acoustics, Speech and Signal Processing, 2008, pp. 5244-5247

[10] Richard C Rose and Douglas B Paul, “ A Hidden Markov Model Based Keyword Recognition System”, in Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on. IEEE, 1990, pp. 129-132.

[11] Jay G Wilpon, Lawrence R Rabiner, C-H Lee, and ER Goldman, ”Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Model”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, no.11, pp. 1870-1878, 1990.

[11] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “ Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”, IEEE Signal Processing Magazine, vol. 29, no.6, pp.82-97, 2012.

[13] Sankaran Panchapagesan, Ming Sun, Aparna Khare, Spyros Matsoukas, Arindam Mandal, Björn Hoffmeister and Shiv Vitaladevuni, “Multi-task Learning and Weighted Cross-entropy for DNN-based Keyword Spotting”, Interspeech 2016, pp.760-764, 2016.

[14] Ming Sun, David Snyder, Yixin Gao, Varun Nagaraja, Mike Rodehorst, Sankaran Panchapagesan, Nikko Strom, Spyros Matsoukas and Shiv Vitaladevuni, “Compressed Time Delay Neural Network for Small Footprint Keyword Spotting”, Proc. Interspeech 2017, pp. 3607-3611, 2017.

[15] Guoguo Chen, Sanjeev Khudanpur, Daniel Povey, Jan Trmal, David Yarowsky and Oguz Yilmaz, “Quantifying the Value of Pronunciation Lexicons for Keyword Search In Low Resource Languages”, IEEE international Conference on Acoustics, 2013.

[16] Pete Warden, “Speech Commands: A Data Set for Limited-Vocabulary Speech Recognition”, arXiv: 1804.03209

[17] Daniel Provey et al. “The Kaldi Speech Recognition Toolkit”, in IEEE AERU. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584