Trace norm regularization and faster inference for embedded speech recognition RNNs
Abstract
We propose and evaluate new techniques for compressing and speeding up dense matrix multiplications as found in the fully connected and recurrent layers of neural networks for embedded large vocabulary continuous speech recognition (LVCSR). For compression, we introduce and study a trace norm regularization technique for training low rank factored versions of matrix multiplications. Compared to standard low rank training, we show that our method more consistently leads to good accuracy versus number of parameter tradeoffs and can be used to speed up training of large models. For speedup, we enable faster inference on ARM processors through new open sourced kernels optimized for small batch sizes, resulting in 3x to 7x speed ups over the widely used gemmlowp library. Beyond LVCSR, we expect our techniques and kernels to be more generally applicable to embedded neural networks with large fully connected or recurrent layers.
Trace norm regularization and faster inference for embedded speech recognition RNNs
Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet & Mohammad Shoeybi 

Baidu Silicon Valley Artificial Intelligence Lab 
{klieglmarkus,goyalsiddharth,zhaokexin01,srinetkavya, 
mohammad}@baidu.com 
1 Introduction
For embedded applications of machine learning, we seek models that are as accurate as possible given constraints on size and on latency at inference time. For many neural networks, the parameters and computation are concentrated in two basic building blocks:

Convolutions. These tend to dominate in, for example, image processing applications.

Dense matrix multiplications (GEMMs) as found, for example, inside fully connected layers or recurrent layers such as GRU and LSTM. These are common in speech and natural language processing applications.
These two building blocks are the natural targets for efforts to reduce parameters and speed up models for embedded applications. Much work on this topic already exists in the literature. For a brief overview, see Section 2.
In this paper, we focus only on dense matrix multiplications and not on convolutions. Our two main contributions are:

Trace norm regularization: We describe a trace norm regularization technique and an accompanying training methodology that enables the practical training of models with competitive accuracy versus number of parameter tradeoffs. It automatically selects the rank and eliminates the need for any prior knowledge on suitable matrix rank.

Efficient kernels for inference: We explore the importance of optimizing for low batch sizes in ondevice inference, and we introduce kernels^{1}^{1}1Available at https://github.com/paddlepaddle/farm. for ARM processors that vastly outperform publicly available kernels in the low batch size regime.
These two topics are discussed in Sections 3 and 4, respectively. Although we conducted our experiments and report results in the context of large vocabulary continuous speech recognition (LVCSR) on embedded devices, the ideas and techniques are broadly applicable to other deep learning networks. Work on compressing any neural network for which large GEMMs dominate the parameters or computation time could benefit from the insights presented in this paper.
2 Related work
Our work is most closely related to that of Prabhavalkar et al. (2016), where low rank factored acoustic speech models are similarly trained by warmstarting from a truncated singular value decomposition (SVD) of pretrained weight matrices. This technique was also applied to speech recognition on mobile devices (McGraw et al., 2016; Xue et al., 2013). We build on this method by adding a variational form of trace norm regularization that was first proposed for collaborative prediction (Srebro et al., 2005) and also applied to recommender systems (Koren et al., 2009). The use of this technique with gradient descent was recently justified theoretically (Ciliberto et al., 2017). Furthermore, Neyshabur et al. (2015) argue that trace norm regularization could provide a sensible inductive bias for neural networks. To the best of our knowledge, we are the first to combine the training technique of Prabhavalkar et al. (2016) with variational trace norm regularization.
Low rank factorization of neural network weights in general has been the subject of many other works (Denil et al., 2013; Sainath et al., 2013; Ba & Caruana, 2014; Kuchaiev & Ginsburg, 2017). Some other approaches for dense matrix compression include sparsity (LeCun et al., 1989; Narang et al., 2017), hashbased parameter sharing (Chen et al., 2015), and other parametersharing schemes such as circulant, Toeplitz, or more generally lowdisplacementrank matrices (Sindhwani et al., 2015; Lu et al., 2016). Kuchaiev & Ginsburg (2017) explore splitting activations into independent groups. Doing so is akin to using blockdiagonal matrices.
3 Training low rank models
Low rank factorization is a well studied and effective technique for compressing large matrices. In Prabhavalkar et al. (2016), low rank models are trained by first training a model with unfactored weight matrices (we refer to this as stage 1), and then warmstarting a model with factored weight matrices from the truncated SVD of the unfactored model (we refer to this as stage 2). The truncation is done by retaining only as many singular values as required to explain a specified percentage of the variance.
If the weight matrices from stage 1 had only a few nonzero singular values, then the truncated SVD used for warmstarting stage 2 would yield a much better or even errorfree approximation of the stage 1 matrix. This suggests applying a sparsityinducing penalty on the vector of singular values during stage 1 training. This is known as trace norm regularization in the literature. Unfortunately, there is no known way of directly computing the trace norm and its gradients that would be computationally feasible in the context of large deep learning models. Instead, we propose to combine the twostage training method of Prabhavalkar et al. (2016) with an indirect variational trace norm regularization technique (Srebro et al., 2005; Ciliberto et al., 2017). We describe this technique in more detail in Section 3.1 and report experimental results in Section 3.2.
3.1 Trace norm regularization
First we introduce some notation. Let us denote by the trace norm of a matrix, that is, the sum of the singular values of the matrix. The trace norm is also referred to as the nuclear norm or the Schatten 1norm in the literature. Furthermore, let us denote by the Frobenius norm of a matrix, defined as
(1) 
The Frobenius norm is identical to the Schatten 2norm of a matrix, i.e. the norm of the singular value vector of the matrix. The following lemma provides a variational characterization of the trace norm in terms of the Frobenius norm.
Lemma 1 (Jameson (1987); Ciliberto et al. (2017)).
Let be an matrix and denote by its vector of singular values. Then
(2) 
where the minimum is taken over all and such that . Furthermore, if is a singular value decomposition of , then equality holds in (2) for the choice and .
The procedure to take advantage of this characterization is as follows. First, for each large GEMM in the model, replace the weight matrix by the product where and . Second, replace the original loss function by
(3) 
where is a hyperparameter controlling the strength of the approximate trace norm regularization. Proposition 1 in Ciliberto et al. (2017) guarantees that minimizing the modified loss equation (3) is equivalent to minimizing the actual trace norm regularized loss:
(4) 
In Section 3.2.1 we show empirically that use of the modified loss (3) is indeed highly effective at reducing the trace norm of the weight matrices.
To summarize, we propose the following basic training scheme:

Stage 1:

For each large GEMM in the model, replace the weight matrix by the product where , , and .

Replace the original loss function by
(5) where is a hyperparameter controlling the strength of the trace norm regularization.

Train the model to convergence.


Stage 2:

For the trained model from stage 1, recover by multiplying the two trained matrices and .

Train low rank models warmstarted from the truncated SVD of . By varying the number of singular values retained, we can control the parameter versus accuracy tradeoff.

One modification to this is described in Section 3.2.3, where we show that it is actually not necessary to train the stage 1 model to convergence before switching to stage 2. By making the transition earlier, training time can be substantially reduced.
3.2 Experiments and results
We report here the results of our experiments related to trace norm regularization. Our baseline model is a forwardonly Deep Speech 2 (Amodei et al., 2016) model with some small modifications such as growing GRU dimensions. More details are given in Appendix B. We train and evaluate on the widely used Wall Street Journal (WSJ) speech corpus.
Since our work focuses only on compressing acoustic models and not language models, the error metric we report is the character error rate (CER) rather than word error rate (WER). As the size and latency constraints vary widely across devices, whenever possible we compare techniques by comparing their accuracy versus number of parameter tradeoff curves.
3.2.1 Stage 1 experiments
In this section, we investigate the effects of training with the modified loss function in (3). For simplicity, we refer to this as trace norm regularization.
As the WSJ corpus is relatively small at around 80 hours of speech, models tend to benefit substantially from regularization. To make comparisons more fair, we also trained unfactored models with an regularization term and searched the hyperparameter space just as exhaustively.
For both trace norm and regularization, we found it beneficial to introduce separate and parameters for determining the strength of regularization for the recurrent and nonrecurrent weight matrices, respectively. In addition to and in initial experiments, we also roughly tuned the learning rate. Since the same learning rate was found to be optimal for nearly all experiments, we just used that for all the experiments reported in this section. The dependence of final CER on and is shown in Figure 1. Separate and values are seen to help for both trace norm and regularization. However, for trace norm regularization, it appears better to fix as a multiple of rather than tuning the two parameters independently.
The first question we are interested in is whether our modified loss (3) is really effective at reducing the trace norm. As we are interested in the relative concentration of singular values rather than their absolute magnitudes, we introduce the following nondimensional metric.
Definition 1.
Let be a nonzero matrix with . Denote by the dimensional vector of singular values of . Then we define the nondimensional trace norm coefficient of as follows:
(6) 
We show in Appendix A that is scaleinvariant and ranges from 0 for rank 1 matrices to 1 for maximalrank matrices with all singular values equal. Intuitively, the smaller , the better can be approximated by a low rank matrix.
As shown in Figure 2, trace norm regularization is indeed highly effective at reducing the nondimensional trace norm coefficient compared to regularization. At very high regularization strengths, regularization also leads to small values. However, from Figure 1 it is apparent that this comes at the expense of relatively high CERs. As shown in Figure 3, this translates into requiring a much lower rank for the truncated SVD to explain, say, 90 % of the variance of the weight matrix for a given CER. Although a few regularized models occasionally achieve low rank, we observe this only at relatively high CER’s and only for some of the weights.
3.2.2 Stage 2 experiments
In this section, we report the results of stage 2 experiments warmstarted from either trace norm or regularized stage 1 models.
For each regularization type, we took the three best stage 1 models (in terms of final CER) and used the truncated SVD of its weights to initialize the weights of stage 2 models. By varying the threshold of variance explained for the SVD truncation, each stage 1 model resulted into multiple stage 2 models. The stage 2 models were trained without regularization (i.e., ) and with the initial learning rate set to three times the final learning rate of the stage 1 model.
As shown in Figure 4, the best models from either trace norm or regularization exhibit similar accuracy versus number of parameter tradeoffs, but the trace norm models more consistently achieve better tradeoff points. For comparison, we also warmstarted some stage 2 models from an unregularized stage 1 model. These models are seen to have significantly lower accuracies, accentuating the need for regularization on the WSJ corpus.
3.2.3 Reducing training time
In the previous sections, we trained the stage 1 models for 40 epochs to full convergence and then trained the stage 2 models for another 40 epochs, again to full convergence. Since the stage 2 models are drastically smaller than the stage 1 models, it takes less time to train them. Hence, shifting the stage 1 to stage 2 transition point to an earlier epoch could substantially reduce training time. In this section, we show that it is indeed possible to do so without hurting final accuracy.
Specifically, we took the stage 1 trace norm and models from Section 3.2.1 that resulted in the best stage 2 models in Section 3.2.2. In that section, we were interested in the parameters vs accuracy tradeoff and used each stage 1 model to warmstart a number of stage 2 models of different sizes. In this section, we instead set a fixed target of 3 M parameters and a fixed overall training budget of 80 epochs but vary the stage 1 to stage 2 transition epoch. For each of the stage 2 runs, we initialize the learning rate with the learning rate of the stage 1 model at the transition epoch. So the learning rate follows the same schedule as if we had trained a single model for 80 epochs. As before, we disable all regularization for stage 2.
The stage 1 model has 21.7 M parameters, whereas the trace norm stage 1 model at 29.8 M parameters is slightly larger due to the factorization. Since the stage 2 models have roughly 3 M parameters and the training time is approximately proportional to the number of parameters, stage 2 models train about 7x and 10x faster, respectively, than the and trace norm stage 1 models. Consequently, large overall training time reductions can be achieved by reducing the number of epochs spent in stage 1 for both and trace norm.
The results are shown in Figure 5. Based on the left panel, it is evident that we can lower the transition epoch number without hurting the final CER. In some cases, we even see marginal CER improvements. For transition epochs of at least 15, we also see slightly better results for trace norm than . In the right panel, we plot the convergence of CER when the transition epoch is 15. We find that the trace norm model’s CER is barely impacted by the transition whereas the models see a huge jump in CER at the transition epoch. Furthermore, the plot suggests that a total of 60 epochs may have sufficed. However, the savings from reducing stage 2 epochs are negligible compared to the savings from reducing the transition epoch.
4 Application to productiongrade embedded speech recognition
With low rank factorization techniques similar^{2}^{2}2This work was done prior to the development of our trace norm regularization. Due to long training cycles for the 10,000+ hours of speech used in this section, we started from pretrained models. However, the techniques in this section are entirely agnostic to such differences. to those described in Section 3, we were able to train large vocabulary continuous speech recognition (LVCSR) models with acceptable numbers of parameters and acceptable loss of accuracy compared to a production server model (baseline). Table 1 shows the baseline along with three different compressed models with much lower number of parameters. The tier3 model employs the techniques of Sections B.4 and B.3. Consequently, it runs significantly faster than the tier1 model, even though they have a similar number of parameters. Unfortunately, this comes at the expense of some loss in accuracy.
Model  Parameters (M)  WER  % Relative 

baseline  
tier1  
tier2  
tier3*  
* The tier3 model is larger but faster than the tier2 model. See main text for details. 
Although low rank factorization significantly reduces the overall computational complexity of our LVCSR system, we still require further optimization to achieve realtime inference on mobile or embedded devices. One approach to speeding up the network is to use lowprecision bit integer representations for weight matrices and matrix multiplications (the GEMM operation in BLAS terminology). This type of quantization reduces both memory as well as computation requirements of the network while only introducing to relative increase in WER.
To perform low precision matrix multiplications, we originally used the gemmlowp library, which provides stateoftheart low precision GEMMs using unsigned 8bit integer values (Jacob & Warden, 2015–2017). However, gemmlowp’s approach is not efficient for small batch sizes. Our application, LVCSR on embedded devices with single user, is dominated by low batch size GEMMs due to the sequential nature of recurrent layers and latency constraints. This can be demonstrated by looking at a simple RNN cell which has the form:
(7) 
This cell contains two main GEMMs: The first, , is sequential and requires a GEMM with batch size 1. The second, , can in principle be performed at higher batch sizes by batching across time. However, choosing a too large batch sizes can significantly delay the output, as the system needs to wait for more future context. In practice, we found that batch sizes higher than around 4 resulted in too high latencies, negatively impacting user experience.
This motivated us to implement custom assembly kernels for the 64bit ARM architecture (AArch64, also known as ARMv8 or ARM64) to further improve the performance of the GEMMs operations. We do not go through the methodological details in this paper. Instead, we are making the kernels and implementation details available at https://github.com/paddlepaddle/farm.
Figure 6 compares the performance of our implementation (denoted by farm) with the gemmlowp library for matrix multiplication on iPhone 7, iPhone 6, and Raspberry Pi 3 Model B. The farm kernels are significantly faster than their gemmlowp counterparts for batch sizes 1 to 4. The peak singlecore theoretical performance for iPhone 7, iPhone 6, and Raspberry Pi 3 are , and Giga Operations per Second, respectively. The gap between the theoretical and achieved values are mostly due to kernels being limited by memory bandwidth. For a more detailed analysis, we refer to the farm website.
In addition to low precision representation and customized ARM kernels, we explored other approaches to speed up our LVCSR system. These techniques are described in Appendix B.
Finally, by combining low rank factorization, some techniques from Appendix B, int8 quantization and the farm kernels, as well as using smaller language models, we could create a range of speech recognition models suitably tailored to various devices. These are shown in Table 2.
Language  % time spent  

Acoustic  model  Speedup over  in acoustic  
Device  model  size (MB)  WER  % Relative  realtime  model 
GPU server  baseline  
iPhone 7  tier1  
iPhone 6  tier2  
Raspberry Pi 3  tier3 
5 Conclusion
We worked on compressing and reducing the inference latency of LVCSR speech recognition models. To better compress models, we introduced a trace norm regularization technique and demonstrated its potential for faster and more consistent training of low rank models on the WSJ speech corpus. To reduce latency at inference time, we demonstrated the importance of optimizing for low batch sizes and released optimized kernels for the ARM64 platform. Finally, by combining the various techniques in this paper, we demonstrated an effective path towards productiongrade ondevice speech recognition on a range of embedded devices.
Acknowledgments
We would like to thank Gregory Diamos, Christopher Fougner, Atul Kumar, Julia Li, Sharan Narang, Thuan Nguyen, Sanjeev Satheesh, Richard Wang, Yi Wang, and Zhenyao Zhu for their helpful comments and assistance with various parts of this paper.
References
 Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep Speech 2: Endtoend speech recognition in English and Mandarin. In International Conference on Machine Learning, pp. 173–182, 2016.
 Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pp. 2654–2662, 2014.
 Chen et al. (2015) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294, 2015.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. Syntax, Semantics and Structure in Statistical Translation, pp. 103, 2014.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Ciliberto et al. (2017) Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Reexamining low rank matrix factorization for trace norm regularization. arXiv preprint arXiv:1706.08934, 2017.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.
 Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pp. 1269–1277, 2014.
 Han et al. (2016) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 Iandola et al. (2016) Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
 Jacob & Warden (2015–2017) Benoit Jacob and Pete Warden. gemmlowp: a small selfcontained lowprecision GEMM library. https://github.com/google/gemmlowp, 2015–2017.
 Jameson (1987) Graham James Oscar Jameson. Summing and nuclear norms in Banach space theory, volume 8. Cambridge University Press, 1987.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.
 Kuchaiev & Ginsburg (2017) Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
 LeCun et al. (1989) Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605, 1989.
 Liu et al. (2017) Hairong Liu, Zhenyao Zhu, Xiangang Li, and Sanjeev Satheesh. GramCTC: Automatic unit selection and target decomposition for sequence labelling. arXiv preprint arXiv:1703.00096, 2017.
 Lu et al. (2016) Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning compact recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5960–5964. IEEE, 2016.
 McGraw et al. (2016) Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Haşim Sak, Alexander Gruenstein, Françoise Beaufays, et al. Personalized speech recognition on mobile devices. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5955–5959. IEEE, 2016.
 Narang et al. (2017) Sharan Narang, Gregory Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. In International Conference on Learning Representations (ICLR), 2017.
 Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In Workshop track ICLR, 2015. arXiv preprint arXiv:1412.6614.
 Prabhavalkar et al. (2016) Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Ian McGraw. On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5970–5974. IEEE, 2016.
 Sainath et al. (2013) Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6655–6659. IEEE, 2013.
 Sindhwani et al. (2015) Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for smallfootprint deep learning. In Advances in Neural Information Processing Systems, pp. 3088–3096, 2015.
 Srebro et al. (2005) Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximummargin matrix factorization. In Advances in neural information processing systems, pp. 1329–1336, 2005.
 Xue et al. (2013) Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, pp. 2365–2369, 2013.
Appendix A Nondimensional trace norm coefficient
In this section, we describe some of the properties of the nondimensional trace norm coefficient defined in Section 3.1.
Proposition 1.
Let be as in Definition 1. Then

for all scalars .

.

if and only if has rank 1.

if and only if has maximal rank and all singular values are equal.
Proof.
Since we are assuming is nonzero, at least one singular value is nonzero and hence . Property (i) is immediate from the scaling property satisfied by all norms.
To establish the other properties, observe that we have
(8) 
The first inequality holds since singular values are nonnegative, and the inequality is strict unless or vanishes. The second inequality comes from an application of Jensen’s inequality and is strict unless . Thus, replacing by preserves while increasing unless one of or is zero. Similarly, replacing by preserves while decreasing unless . By a simple argument by contradiction, it follows that the minima occur for , in which case and the maxima occur for , in which case . ∎
We can also obtain a better intuition about the minimum and maximum of by looking at the 2D case visualized in Figure 7. For a fixed , can vary from to . The minimum happens when either or are zero. For these values and as a result . Similarly, the maximum happens for , resulting in .
Appendix B Model design considerations
We describe here a few preliminary insights that informed our choice of baseline model for the experiments reported in Sections 3 and 4.
Since the target domain is ondevice streaming speech recognition with low latency, we chose to focus on Deep Speech 2 like models with forwardonly GRU layers (Amodei et al., 2016).
b.1 Growing recurrent layer sizes
Across several data sets and model architectures, we consistently found that the sizes of the recurrent layers closer to the input could be shrunk without affecting accuracy much. A related phenomenon was observed in Prabhavalkar et al. (2016): When doing low rank approximations of the acoustic model layers using SVD, the rank required to explain a fixed threshold of explained variance grows with distance from the input layer.
To reduce the number of parameters of the baseline model and speed up experiments, we thus chose to adopt growing GRU dimensions. Since the hope is that the compression techniques studied in this paper will automatically reduce layers to a nearoptimal size, we chose to not tune these dimensions, but simply picked a reasonable affine increasing scheme of 768, 1024, 1280 for the GRU dimensions, and dimension 1536 for the final fully connected layer.
b.2 Parameter sharing in the low rank factorization
For the recurrent layers, we employ the Gated Recurrent Unit (GRU) architecture proposed in Cho et al. (2014); Chung et al. (2014), where the hidden state is computed as follows:
(9)  
where is the sigmoid function, and are update and reset gates respectively, are the three recurrent weight matrices, and are the three nonrecurrent weight matrices.
We consider here three ways of performing weight sharing when doing low rank factorization of the 6 weight matrices.

Completely joint factorization. Here we concatenate the 6 weight matrices along the first dimension and apply low rank factorization to this single combined matrix.

Partially joint factorization. Here we concatenate the 3 recurrent matrices into a single matrix and likewise concatenate the 3 nonrecurrent matrices into a single matrix . We then apply low rank factorization to each of and separately.

Completely split factorization. Here we apply low rank factorization to each of the 6 weight matrices separately.
In (Prabhavalkar et al., 2016; Kuchaiev & Ginsburg, 2017), the authors opted for the LSTM analog of completely joint factorization, as this choice has the most parameter sharing and thus the highest potential for compression of the model. However, we decided to go with partially joint factorization instead, largely for two reasons. First, in pilot experiments, we found that the and matrices behave qualitatively quite differently during training. For example, on large data sets the matrices may be trained from scratch in factored form, whereas factored matrices need to be either warmstarted via SVD from a trained unfactored model or trained with a significantly lowered learning rate. Second, the and split is advantageous in terms of computational efficiency. For the nonrecurrent GEMM, there is no sequential time dependency and thus its inputs may be batched across time.
Finally, we compared the partially joint factorization to the completely split factorization and found that the former indeed led to better accuracy versus number of parameters tradeoffs. Some results from this experiment are shown in Table 3.
Completely split  Partially joint  

SVD threshold  Parameters (M)  CER  Parameters (M)  CER 
b.3 Mel and smaller convolution filters
Switching from 161dimensional linear spectrograms to 80dimensional mel spectrograms reduces the pertimestep feature dimension by roughly a factor of 2. Furthermore, and likely owing to this switch, we could reduce the frequencydimension size of the convolution filters by a factor of 2. In combination, this means about a 4x reduction in compute for the first and second convolution layers, and a 2x reduction in compute for the first GRU layer.
On the WSJ corpus as well as an internal dataset of around 1,000 hours of speech, we saw little impact on accuracy from making this change, and hence we adopted it for all experiments in Section 3.
b.4 GramCTC and increased stride in convolutions
GramCTC is a recently proposed extension to CTC for training models that output variablesize grams as opposed to single characters (Liu et al., 2017). Using GramCTC, we were able to increase the time stride in the second convolution layer by a factor of 2 with little to no loss in CER, though we did have to double the number of filters in that same convolution layer to compensate. The net effect is a roughly 2x speedup for the second and third GRU layers, which are the largest. This speed up more than makes up for the size increase in the softmax layer and the slightly more complex language model decoding when using GramCTC. However, for a given target accuracy, we found that GramCTC models could not be shrunk as much as CTC models by means of low rank factorization. That is, the net effect of this technique is to an increase model size for reduced latency.
b.5 low rank factorization versus learned sparsity
Shown in Figure 8 is the parameter reduction versus relative CER increase tradeoff for various techniques on an internal data set of around 1,000 hours of speech.
The baseline model is a Deep Speech 2 model with three forwardGRU layers of dimension 2560, as described in Amodei et al. (2016). This is the same baseline model used in the experiments of Narang et al. (2017), from which paper we also obtained the sparse data points in the plot. Shown also are versions of the baseline model but with the GRU dimension scaled down to 1536 and 1024. Overall, models with low rank factorizations on all nonrecurrent and recurrent weight matrices are seen to provide the best CER vs parameters tradeoff. All the low rank models use growing GRU dimensions and the partially split form of low rank factorization, as discussed in Sections B.1 and B.2. The models labeled fast in addition use GramCTC as described in Section B.4 and mel features and reduced convolution filter sizes as described in Section B.3.
As this was more of a preliminary comparison to some past experiments, the setup was not perfectly controlled and some models were, for example, trained for more epochs than others. We suspect that, given more effort and similar adjustments like growing GRU dimensions, the sparse models could be made competitive with the low rank models. Even so, given the computational advantage of the low rank approach over unstructured sparsity, we chose to focus only on the former going forward. This does not, of course, rule out the potential usefulness of other, more structured forms of sparsity in the embedded setting.