# Trace norm regularization and faster inference for embedded speech recognition RNNs

## Abstract

We propose and evaluate new techniques for compressing and speeding up dense matrix multiplications as found in the fully connected and recurrent layers of neural networks for embedded large vocabulary continuous speech recognition (LVCSR). For compression, we introduce and study a trace norm regularization technique for training low rank factored versions of matrix multiplications. Compared to standard low rank training, we show that our method more consistently leads to good accuracy versus number of parameter trade-offs and can be used to speed up training of large models. For speedup, we enable faster inference on ARM processors through new open sourced kernels optimized for small batch sizes, resulting in 3x to 7x speed ups over the widely used gemmlowp library. Beyond LVCSR, we expect our techniques and kernels to be more generally applicable to embedded neural networks with large fully connected or recurrent layers.

## 1Introduction

For embedded applications of machine learning, we seek models that are as accurate as possible given constraints on size and on latency at inference time. For many neural networks, the parameters and computation are concentrated in two basic building blocks:

Convolutions

. These tend to dominate in, for example, image processing applications.

Dense matrix multiplications

(GEMMs) as found, for example, inside fully connected layers or recurrent layers such as GRU and LSTM. These are common in speech and natural language processing applications.

These two building blocks are the natural targets for efforts to reduce parameters and speed up models for embedded applications. Much work on this topic already exists in the literature. For a brief overview, see Section 2.

In this paper, we focus only on dense matrix multiplications and not on convolutions. Our two main contributions are:

Trace norm regularization:

We describe a trace norm regularization technique and an accompanying training methodology that enables the practical training of models with competitive accuracy versus number of parameter trade-offs. It automatically selects the rank and eliminates the need for any prior knowledge on suitable matrix rank.

Efficient kernels for inference:

We explore the importance of optimizing for low batch sizes in on-device inference, and we introduce kernels

for ARM processors that vastly outperform publicly available kernels in the low batch size regime.^{1}

These two topics are discussed in Sections Section 3 and Section 4, respectively. Although we conducted our experiments and report results in the context of large vocabulary continuous speech recognition (LVCSR) on embedded devices, the ideas and techniques are broadly applicable to other deep learning networks. Work on compressing any neural network for which large GEMMs dominate the parameters or computation time could benefit from the insights presented in this paper.

## 2Related work

Our work is most closely related to that of [21], where low rank factored acoustic speech models are similarly trained by warmstarting from a truncated singular value decomposition (SVD) of pretrained weight matrices. This technique was also applied to speech recognition on mobile devices [18]. We build on this method by adding a variational form of trace norm regularization that was first proposed for collaborative prediction [24] and also applied to recommender systems [13]. The use of this technique with gradient descent was recently justified theoretically [6]. Furthermore, [20] argue that trace norm regularization could provide a sensible inductive bias for neural networks. To the best of our knowledge, we are the first to combine the training technique of [21] with variational trace norm regularization.

Low rank factorization of neural network weights in general has been the subject of many other works [7]. Some other approaches for dense matrix compression include sparsity [15], hash-based parameter sharing [3], and other parameter-sharing schemes such as circulant, Toeplitz, or more generally low-displacement-rank matrices [23]. [14] explore splitting activations into independent groups. Doing so is akin to using block-diagonal matrices.

The techniques for compressing convolutional models are different and beyond the scope of this paper. We refer the interested reader to, e.g., [8] and references therein.

## 3Training low rank models

Low rank factorization is a well studied and effective technique for compressing large matrices. In [21], low rank models are trained by first training a model with unfactored weight matrices (we refer to this as stage 1), and then warmstarting a model with factored weight matrices from the truncated SVD of the unfactored model (we refer to this as stage 2). The truncation is done by retaining only as many singular values as required to explain a specified percentage of the variance.

If the weight matrices from stage 1 had only a few nonzero singular values, then the truncated SVD used for warmstarting stage 2 would yield a much better or even error-free approximation of the stage 1 matrix. This suggests applying a sparsity-inducing penalty on the vector of singular values during stage 1 training. This is known as trace norm regularization in the literature. Unfortunately, there is no known way of directly computing the trace norm and its gradients that would be computationally feasible in the context of large deep learning models. Instead, we propose to combine the two-stage training method of [21] with an indirect variational trace norm regularization technique [24]. We describe this technique in more detail in Section 3.1 and report experimental results in Section 3.2.

### 3.1Trace norm regularization

First we introduce some notation. Let us denote by the *trace norm* of a matrix, that is, the sum of the singular values of the matrix. The trace norm is also referred to as the *nuclear norm* or the *Schatten 1-norm* in the literature. Furthermore, let us denote by the Frobenius norm of a matrix, defined as

The Frobenius norm is identical to the *Schatten 2-norm* of a matrix, i.e. the norm of the singular value vector of the matrix. The following lemma provides a variational characterization of the trace norm in terms of the Frobenius norm.

The procedure to take advantage of this characterization is as follows. First, for each large GEMM in the model, replace the weight matrix by the product where and . Second, replace the original loss function by

where is a hyperparameter controlling the strength of the approximate trace norm regularization. Proposition 1 in [6] guarantees that minimizing the modified loss equation is equivalent to minimizing the actual trace norm regularized loss:

In Section ? we show empirically that use of the modified loss is indeed highly effective at reducing the trace norm of the weight matrices.

To summarize, we propose the following basic training scheme:

Stage 1:

For each large GEMM in the model, replace the weight matrix by the product where , , and .

Replace the original loss function by

where is a hyperparameter controlling the strength of the trace norm regularization.

Train the model to convergence.

Stage 2:

For the trained model from stage 1, recover by multiplying the two trained matrices and .

Train low rank models warmstarted from the truncated SVD of . By varying the number of singular values retained, we can control the parameter versus accuracy trade-off.

One modification to this is described in Section ?, where we show that it is actually not necessary to train the stage 1 model to convergence before switching to stage 2. By making the transition earlier, training time can be substantially reduced.

### 3.2Experiments and results

We report here the results of our experiments related to trace norm regularization. Our baseline model is a forward-only Deep Speech 2 [1] model with some small modifications such as growing GRU dimensions. More details are given in Appendix B. We train and evaluate on the widely used Wall Street Journal (WSJ) speech corpus.

Since our work focuses only on compressing acoustic models and not language models, the error metric we report is the character error rate (CER) rather than word error rate (WER). As the size and latency constraints vary widely across devices, whenever possible we compare techniques by comparing their accuracy versus number of parameter trade-off curves.

#### Stage 1 experiments

In this section, we investigate the effects of training with the modified loss function in . For simplicity, we refer to this as *trace norm regularization*.

As the WSJ corpus is relatively small at around 80 hours of speech, models tend to benefit substantially from regularization. To make comparisons more fair, we also trained unfactored models with an regularization term and searched the hyperparameter space just as exhaustively.

For both trace norm and regularization, we found it beneficial to introduce separate and parameters for determining the strength of regularization for the recurrent and non-recurrent weight matrices, respectively. In addition to and in initial experiments, we also roughly tuned the learning rate. Since the same learning rate was found to be optimal for nearly all experiments, we just used that for all the experiments reported in this section. The dependence of final CER on and is shown in Figure 1. Separate and values are seen to help for both trace norm and regularization. However, for trace norm regularization, it appears better to fix as a multiple of rather than tuning the two parameters independently.

The first question we are interested in is whether our modified loss is really effective at reducing the trace norm. As we are interested in the relative concentration of singular values rather than their absolute magnitudes, we introduce the following nondimensional metric.

We show in Appendix A that is scale-invariant and ranges from 0 for rank 1 matrices to 1 for maximal-rank matrices with all singular values equal. Intuitively, the smaller , the better can be approximated by a low rank matrix.

As shown in Figure 2, trace norm regularization is indeed highly effective at reducing the nondimensional trace norm coefficient compared to regularization. At very high regularization strengths, regularization also leads to small values. However, from Figure 1 it is apparent that this comes at the expense of relatively high CERs. As shown in Figure 3, this translates into requiring a much lower rank for the truncated SVD to explain, say, 90 % of the variance of the weight matrix for a given CER. Although a few -regularized models occasionally achieve low rank, we observe this only at relatively high CER’s and only for some of the weights.

#### Stage 2 experiments

In this section, we report the results of stage 2 experiments warmstarted from either trace norm or regularized stage 1 models.

For each regularization type, we took the three best stage 1 models (in terms of final CER) and used the truncated SVD of its weights to initialize the weights of stage 2 models. By varying the threshold of variance explained for the SVD truncation, each stage 1 model resulted into multiple stage 2 models. The stage 2 models were trained without regularization (i.e., ) and with the initial learning rate set to three times the final learning rate of the stage 1 model.

As shown in Figure 4, the best models from either trace norm or regularization exhibit similar accuracy versus number of parameter trade-offs, but the trace norm models more consistently achieve better trade-off points. For comparison, we also warmstarted some stage 2 models from an unregularized stage 1 model. These models are seen to have significantly lower accuracies, accentuating the need for regularization on the WSJ corpus.

#### Reducing training time

In the previous sections, we trained the stage 1 models for 40 epochs to full convergence and then trained the stage 2 models for another 40 epochs, again to full convergence. Since the stage 2 models are drastically smaller than the stage 1 models, it takes less time to train them. Hence, shifting the stage 1 to stage 2 transition point to an earlier epoch could substantially reduce training time. In this section, we show that it is indeed possible to do so without hurting final accuracy.

Specifically, we took the stage 1 trace norm and models from Section ? that resulted in the best stage 2 models in Section ?. In that section, we were interested in the parameters vs accuracy trade-off and used each stage 1 model to warmstart a number of stage 2 models of different sizes. In this section, we instead set a fixed target of 3 M parameters and a fixed overall training budget of 80 epochs but vary the stage 1 to stage 2 transition epoch. For each of the stage 2 runs, we initialize the learning rate with the learning rate of the stage 1 model at the transition epoch. So the learning rate follows the same schedule as if we had trained a single model for 80 epochs. As before, we disable all regularization for stage 2.

The stage 1 model has 21.7 M parameters, whereas the trace norm stage 1 model at 29.8 M parameters is slightly larger due to the factorization. Since the stage 2 models have roughly 3 M parameters and the training time is approximately proportional to the number of parameters, stage 2 models train about 7x and 10x faster, respectively, than the and trace norm stage 1 models. Consequently, large overall training time reductions can be achieved by reducing the number of epochs spent in stage 1 for both and trace norm.

The results are shown in Figure 5. Based on the left panel, it is evident that we can lower the transition epoch number without hurting the final CER. In some cases, we even see marginal CER improvements. For transition epochs of at least 15, we also see slightly better results for trace norm than . In the right panel, we plot the convergence of CER when the transition epoch is 15. We find that the trace norm model’s CER is barely impacted by the transition whereas the models see a huge jump in CER at the transition epoch. Furthermore, the plot suggests that a total of 60 epochs may have sufficed. However, the savings from reducing stage 2 epochs are negligible compared to the savings from reducing the transition epoch.

## 4Application to production-grade embedded speech recognition

With low rank factorization techniques similar^{2}

Model | Parameters (M) | WER | % Relative |
---|---|---|---|

baseline | 115.5 | 8.78 | 0.0 |

tier-1 | 14.9 | 9.25 | -5.4 |

tier-2 | 10.9 | 9.80 | -11.6 |

tier-3* | 14.7 | 9.92 | -13.0 |

Although low rank factorization significantly reduces the overall computational complexity of our LVCSR system, we still require further optimization to achieve real-time inference on mobile or embedded devices. One approach to speeding up the network is to use low-precision -bit integer representations for weight matrices and matrix multiplications (the GEMM operation in BLAS terminology). This type of quantization reduces both memory as well as computation requirements of the network while only introducing to relative increase in WER.

To perform low precision matrix multiplications, we originally used the *gemmlowp* library, which provides state-of-the-art low precision GEMMs using unsigned 8-bit integer values [11]. However, gemmlowp’s approach is not efficient for small batch sizes. Our application, LVCSR on embedded devices with single user, is dominated by low batch size GEMMs due to the sequential nature of recurrent layers and latency constraints. This can be demonstrated by looking at a simple RNN cell which has the form:

This cell contains two main GEMMs: The first, , is sequential and requires a GEMM with batch size 1. The second, , can in principle be performed at higher batch sizes by batching across time. However, choosing a too large batch sizes can significantly delay the output, as the system needs to wait for more future context. In practice, we found that batch sizes higher than around 4 resulted in too high latencies, negatively impacting user experience.

This motivated us to implement custom assembly kernels for the 64-bit ARM architecture (AArch64, also known as ARMv8 or ARM64) to further improve the performance of the GEMMs operations. We do not go through the methodological details in this paper. Instead, we are making the kernels and implementation details available at https://github.com/paddlepaddle/farm.

Figure ? compares the performance of our implementation (denoted by *farm*) with the gemmlowp library for matrix multiplication on iPhone 7, iPhone 6, and Raspberry Pi 3 Model B. The farm kernels are significantly faster than their gemmlowp counterparts for batch sizes 1 to 4. The peak single-core theoretical performance for iPhone 7, iPhone 6, and Raspberry Pi 3 are , and Giga Operations per Second, respectively. The gap between the theoretical and achieved values are mostly due to kernels being limited by memory bandwidth. For a more detailed analysis, we refer to the farm website.

In addition to low precision representation and customized ARM kernels, we explored other approaches to speed up our LVCSR system. These techniques are described in Appendix B.

Finally, by combining low rank factorization, some techniques from Appendix B, int8 quantization and the farm kernels, as well as using smaller language models, we could create a range of speech recognition models suitably tailored to various devices. These are shown in Table 2.

Language | % time spent | |||||

Acoustic | model | Speedup over | in acoustic | |||

Device | model | size (MB) | WER | % Relative | real-time | model |

GPU server | baseline | 13764 | 8.78 | 0.0 | 10.39x | 70.8 |

iPhone 7 | tier-1 | 56 | 10.50 | -19.6 | 2.21x | 65.2 |

iPhone 6 | tier-2 | 32 | 11.19 | -27.4 | 1.13x | 75.5 |

Raspberry Pi 3 | tier-3 | 14 | 12.08 | -37.6 | 1.08x | 86.3 |

## 5Conclusion

We worked on compressing and reducing the inference latency of LVCSR speech recognition models. To better compress models, we introduced a trace norm regularization technique and demonstrated its potential for faster and more consistent training of low rank models on the WSJ speech corpus. To reduce latency at inference time, we demonstrated the importance of optimizing for low batch sizes and released optimized kernels for the ARM64 platform. Finally, by combining the various techniques in this paper, we demonstrated an effective path towards production-grade on-device speech recognition on a range of embedded devices.

#### Acknowledgments

We would like to thank Gregory Diamos, Christopher Fougner, Atul Kumar, Julia Li, Sharan Narang, Thuan Nguyen, Sanjeev Satheesh, Richard Wang, Yi Wang, and Zhenyao Zhu for their helpful comments and assistance with various parts of this paper.

## ANondimensional trace norm coefficient

In this section, we describe some of the properties of the non-dimensional trace norm coefficient defined in Section 3.1.

Since we are assuming is nonzero, at least one singular value is nonzero and hence . Property is immediate from the scaling property satisfied by all norms.

To establish the other properties, observe that we have

The first inequality holds since singular values are nonnegative, and the inequality is strict unless or vanishes. The second inequality comes from an application of Jensen’s inequality and is strict unless . Thus, replacing by preserves while increasing unless one of or is zero. Similarly, replacing by preserves while decreasing unless . By a simple argument by contradiction, it follows that the minima occur for , in which case and the maxima occur for , in which case .

We can also obtain a better intuition about the minimum and maximum of by looking at the 2D case visualized in Figure 6. For a fixed , can vary from to . The minimum happens when either or are zero. For these values and as a result . Similarly, the maximum happens for , resulting in .

## BModel design considerations

We describe here a few preliminary insights that informed our choice of baseline model for the experiments reported in Sections Section 3 and Section 4.

Since the target domain is on-device streaming speech recognition with low latency, we chose to focus on Deep Speech 2 like models with forward-only GRU layers [1].

### b.1Growing recurrent layer sizes

Across several data sets and model architectures, we consistently found that the sizes of the recurrent layers closer to the input could be shrunk without affecting accuracy much. A related phenomenon was observed in [21]: When doing low rank approximations of the acoustic model layers using SVD, the rank required to explain a fixed threshold of explained variance grows with distance from the input layer.

To reduce the number of parameters of the baseline model and speed up experiments, we thus chose to adopt growing GRU dimensions. Since the hope is that the compression techniques studied in this paper will automatically reduce layers to a near-optimal size, we chose to not tune these dimensions, but simply picked a reasonable affine increasing scheme of 768, 1024, 1280 for the GRU dimensions, and dimension 1536 for the final fully connected layer.

### b.2Parameter sharing in the low rank factorization

For the recurrent layers, we employ the Gated Recurrent Unit (GRU) architecture proposed in [4], where the hidden state is computed as follows:

where is the sigmoid function, and are update and reset gates respectively, are the three recurrent weight matrices, and are the three non-recurrent weight matrices.

We consider here three ways of performing weight sharing when doing low rank factorization of the 6 weight matrices.

Completely joint factorization.

Here we concatenate the 6 weight matrices along the first dimension and apply low rank factorization to this single combined matrix.

Partially joint factorization.

Here we concatenate the 3 recurrent matrices into a single matrix and likewise concatenate the 3 non-recurrent matrices into a single matrix . We then apply low rank factorization to each of and separately.

Completely split factorization.

Here we apply low rank factorization to each of the 6 weight matrices separately.

In [21], the authors opted for the LSTM analog of *completely joint factorization*, as this choice has the most parameter sharing and thus the highest potential for compression of the model. However, we decided to go with *partially joint factorization* instead, largely for two reasons. First, in pilot experiments, we found that the and matrices behave qualitatively quite differently during training. For example, on large data sets the matrices may be trained from scratch in factored form, whereas factored matrices need to be either warmstarted via SVD from a trained unfactored model or trained with a significantly lowered learning rate. Second, the and split is advantageous in terms of computational efficiency. For the non-recurrent GEMM, there is no sequential time dependency and thus its inputs may be batched across time.

Finally, we compared the partially joint factorization to the completely split factorization and found that the former indeed led to better accuracy versus number of parameters trade-offs. Some results from this experiment are shown in Table 3.

(r)2-3 (r)4-5 SVD threshold | Parameters (M) | CER | Parameters (M) | CER |

0.50 | 6.3 | 10.3 | 5.5 | 10.3 |

0.60 | 8.7 | 10.5 | 7.5 | 10.2 |

0.70 | 12.0 | 10.3 | 10.2 | 9.9 |

0.80 | 16.4 | 10.1 | 13.7 | 9.7 |

### b.3Mel and smaller convolution filters

Switching from 161-dimensional linear spectrograms to 80-dimensional mel spectrograms reduces the per-timestep feature dimension by roughly a factor of 2. Furthermore, and likely owing to this switch, we could reduce the frequency-dimension size of the convolution filters by a factor of 2. In combination, this means about a 4x reduction in compute for the first and second convolution layers, and a 2x reduction in compute for the first GRU layer.

On the WSJ corpus as well as an internal dataset of around 1,000 hours of speech, we saw little impact on accuracy from making this change, and hence we adopted it for all experiments in Section 3.

### b.4Gram-CTC and increased stride in convolutions

Gram-CTC is a recently proposed extension to CTC for training models that output variable-size grams as opposed to single characters [16]. Using Gram-CTC, we were able to increase the time stride in the second convolution layer by a factor of 2 with little to no loss in CER, though we did have to double the number of filters in that same convolution layer to compensate. The net effect is a roughly 2x speedup for the second and third GRU layers, which are the largest. This speed up more than makes up for the size increase in the softmax layer and the slightly more complex language model decoding when using Gram-CTC. However, for a given target accuracy, we found that Gram-CTC models could not be shrunk as much as CTC models by means of low rank factorization. That is, the net effect of this technique is to an increase model size for reduced latency.

### b.5low rank factorization versus learned sparsity

Shown in Figure 7 is the parameter reduction versus relative CER increase trade-off for various techniques on an internal data set of around 1,000 hours of speech.

The baseline model is a Deep Speech 2 model with three forward-GRU layers of dimension 2560, as described in [1]. This is the same baseline model used in the experiments of [19], from which paper we also obtained the sparse data points in the plot. Shown also are versions of the baseline model but with the GRU dimension scaled down to 1536 and 1024. Overall, models with low rank factorizations on all non-recurrent and recurrent weight matrices are seen to provide the best CER vs parameters trade-off. All the low rank models use growing GRU dimensions and the partially split form of low rank factorization, as discussed in Sections Appendix B.1 and Appendix B.2. The models labeled *fast* in addition use Gram-CTC as described in Section Appendix B.4 and mel features and reduced convolution filter sizes as described in Section Appendix B.3.

As this was more of a preliminary comparison to some past experiments, the setup was not perfectly controlled and some models were, for example, trained for more epochs than others. We suspect that, given more effort and similar adjustments like growing GRU dimensions, the sparse models could be made competitive with the low rank models. Even so, given the computational advantage of the low rank approach over unstructured sparsity, we chose to focus only on the former going forward. This does not, of course, rule out the potential usefulness of other, more structured forms of sparsity in the embedded setting.

### Footnotes

- Available at https://github.com/paddlepaddle/farm.
- This work was done prior to the development of our trace norm regularization. Due to long training cycles for the 10,000+ hours of speech used in this section, we started from pretrained models. However, the techniques in this section are entirely agnostic to such differences.

### References

**Deep Speech 2: End-to-end speech recognition in English and Mandarin.**

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. In*International Conference on Machine Learning*, pp. 173–182, 2016.**Do deep nets really need to be deep?**

Jimmy Ba and Rich Caruana. In*Advances in neural information processing systems*, pp. 2654–2662, 2014.**Compressing neural networks with the hashing trick.**

Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. In*International Conference on Machine Learning*, pp. 2285–2294, 2015.**On the properties of neural machine translation: Encoder-decoder approaches.**

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. Syntax, Semantics and Structure in Statistical Translation**Empirical evaluation of gated recurrent neural networks on sequence modeling.**

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. arXiv preprint arXiv:1412.3555**Reexamining low rank matrix factorization for trace norm regularization.**

Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. arXiv preprint arXiv:1706.08934**Predicting parameters in deep learning.**

Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. In*Advances in Neural Information Processing Systems*, pp. 2148–2156, 2013.**Exploiting linear structure within convolutional networks for efficient evaluation.**

Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. In*Advances in Neural Information Processing Systems*, pp. 1269–1277, 2014.**Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.**

Song Han, Huizi Mao, and William J Dally. In*International Conference on Learning Representations (ICLR)*, 2016.**SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size.**

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. arXiv preprint arXiv:1602.07360**gemmlowp: a small self-contained low-precision GEMM library.**

Benoit Jacob and Pete Warden. https://github.com/google/gemmlowpSumming and nuclear norms in Banach space theory

Graham James Oscar Jameson. , volume 8.**Matrix factorization techniques for recommender systems.**

Yehuda Koren, Robert Bell, and Chris Volinsky. Computer**Factorization tricks for LSTM networks.**

Oleksii Kuchaiev and Boris Ginsburg. arXiv preprint arXiv:1703.10722**Optimal brain damage.**

Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. In*Advances in Neural Information Processing Systems*, pp. 598–605, 1989.**Gram-CTC: Automatic unit selection and target decomposition for sequence labelling.**

Hairong Liu, Zhenyao Zhu, Xiangang Li, and Sanjeev Satheesh. arXiv preprint arXiv:1703.00096**Learning compact recurrent neural networks.**

Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. In*Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on*, pp. 5960–5964. IEEE, 2016.**Personalized speech recognition on mobile devices.**

Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Haşim Sak, Alexander Gruenstein, Françoise Beaufays, et al. In*Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on*, pp. 5955–5959. IEEE, 2016.**Exploring sparsity in recurrent neural networks.**

Sharan Narang, Gregory Diamos, Shubho Sengupta, and Erich Elsen. In*International Conference on Learning Representations (ICLR)*, 2017.**In search of the real inductive bias: On the role of implicit regularization in deep learning.**

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In*Workshop track ICLR*, 2015.**On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition.**

Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Ian McGraw. In*Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on*, pp. 5970–5974. IEEE, 2016.**Low-rank matrix factorization for deep neural network training with high-dimensional output targets.**

Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. In*Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on*, pp. 6655–6659. IEEE, 2013.**Structured transforms for small-footprint deep learning.**

Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. In*Advances in Neural Information Processing Systems*, pp. 3088–3096, 2015.**Maximum-margin matrix factorization.**

Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. In*Advances in neural information processing systems*, pp. 1329–1336, 2005.**Restructuring of deep neural network acoustic models with singular value decomposition.**

Jian Xue, Jinyu Li, and Yifan Gong. In*Interspeech*, pp. 2365–2369, 2013.