Blending LSTMs into CNNs
Abstract
We consider whether deep convolutional networks (CNNs) can represent decision functions with similar accuracy as recurrent networks such as LSTMs. First, we show that a deep CNN with an architecture inspired by the models recently introduced in image recognition can yield better accuracy than previous convolutional and LSTM networks on the standard 309h Switchboard automatic speech recognition task. Then we show that even more accurate CNNs can be trained under the guidance of LSTMs using a variant of model compression, which we call model blending because the teacher and student models are similar in complexity but different in inductive bias. Blending further improves the accuracy of our CNN, yielding a computationally efficient model of accuracy higher than any of the other individual models. Examining the effect of “dark knowledge” in this model compression task, we find that less than 1% of the highest probability labels are needed for accurate model compression.
Blending LSTMs into CNNs
Krzysztof J. Geras, Abdelrahman Mohamed, Rich Caruana, Gregor Urban, 

Shengjie Wang, Özlem Aslan, Matthai Philipose, Matthew Richardson & Charles Sutton 
University of Edinburgh 
Microsoft Research 
UC Irvine 
University of Washington 
University of Alberta 
1 Introduction
There is evidence that feedforward neural networks trained with current training algorithms use their large capacity inefficiently (Le Cun et al., 1990; Denil et al., 2013; Dauphin & Bengio, 2013; Ba & Caruana, 2014; Hinton et al., 2015; Han et al., 2016). Although this excess capacity may be necessary for accurate learning and generalization at training time, the function once learned often can be represented much more compactly. As deep neural net models become larger, their accuracy often increases, but the difficulty of deploying them also rises. Methods such as model compression sometimes allow the accurate functions learned by large, complex models to be compressed into smaller models that are computationally more efficient at runtime.
There are a number of different kinds of deep neural networks such as deep fullyconnected neural networks (DNNs), convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Different domains typically benefit from deep models of different types. For example, CNNs usually yield highest accuracy in domains such as image recognition where the input forms a regular 1, 2, or 3D image plane with structure that is partially invariant to shifts in position or scale. On the other hand, recurrent network models such as LSTMs appear to be better suited to applications such as speech recognition or language modeling where inputs form sequences of varying lengths with short and longrange interactions of different scales.
The differences between these deep learning architectures raise interesting questions about what is learnable by different kinds of deep models, and when it is possible for a deep model of one kind to represent and learn the function learned by a different kind of deep model. The success of model compression on feedforward networks raises the question of whether other neural network architectures that embody different inductive biases can also be compressed. For example, can a classification function learnt by an LSTM be represented by a CNN with a wide enough window to span the important longrange interactions?
In this paper we demonstrate that, for a speech recognition task, it is possible to train very accurate CNN models, outperforming LSTMs. This is thanks to CNN architectures inspired by recent developments in computer vision (Simonyan & Zisserman, 2014), which were not previously considered for speech recognition. Moreover, the experiments suggest that LSTMs and CNNs learn different functions when trained on the same data. This difference creates an opportunity: by merging the functions learned by CNNs and LSTMs into a single model we can obtain better accuracy than either model class achieved independently. One way to perform this merge is through a variant of model compression that we call model blending because the student and teacher models are of comparable size and have complementary inductive biases. Although at this point model compression is a well established technique for models of different capacity, the question of whether compression is also effective for models of similar capacity has not been explored. For example, by blending an LSTM teacher with a CNN student, we are able to train a CNN that is more accurate because it benefits from what the LSTM learned and also computationally more efficient than an LSTM would be at runtime. This blending process is somewhat analogous to forming an ensemble of LSTMs and CNNs and then training a student CNN to mimic the ensemble, but in blending no explicit ensemble of LSTMs and CNNs need be formed. The blended model is 6.8 times more efficient at testing time than the analogous ensemble. Blending further improves the accuracy of our convolutional model, yielding a computationally efficient model of accuracy higher than any of the other individual models. Examining the effect of “dark knowledge” (Hinton et al., 2015) in this model compression task, we find that only 0.3% of the highest probability labels are needed during the model compression procedure. Intriguingly, we also find that model blending is even effective in the selfteacheing setting when the student and the teacher are of the same architecture. We show results of an experiment in which we use a CNN to teach another CNN of the same architecture. Such a student CNN is weaker than a CNN student of the LSTMs but still significantly stronger than a baseline trained only with the hard labels in the data set.
2 Background
In model compression (Bucila et al., 2006), one model (a student) is trained to mimic another model (a teacher). Typically, the student model is small and the teacher is a larger, more powerful model, which has high accuracy but is computationally too expensive to use at test time. For classification, this mimicry can be performed in two ways. One way is to train the student model to match logits (i.e. the values in the output layer of the network, before applying the softmax to compute the output class probabilities ) predicted by the teacher on the training data, penalising the difference between logits of the two models with a squared loss. Alternatively, compression can be done by training the student model to match class probabilities predicted by the teacher, by penalising crossentropy between predictions of the teacher and predictions of the student, i.e. by minimising  averaged over training examples. We will refer to predictions made the teacher as soft labels. In the context of deep neural networks, this approach to model compression is also known as knowledge distillation (Hinton et al., 2015). Additionally, the training loss can also include the loss on the original 01 hard labels.
The main advantage of training the student using model compression is that a student trained with knowledge provided by the teacher gets a richer supervision signal than just the hard 01 labels in the training data, i.e. for each training example, it gets the information not only about the correct class but also about uncertainty, i.e., how similar the current training example is to those of other classes. Model compression can be viewed as a way to transfer inductive biases between models. For example, in the case of compressing deep models into shallow ones (Ba & Caruana, 2014; Urban et al., 2016), the student is benefiting from the hierarchical representation learned in the deep model, despite not being able to learn it on its own from hard labels.
While model compression can be applied to arbitrary classifiers producing probabilistic predictions, with the recent success of deep neural networks, work on model compression focused on compressing large deep neural networks or ensembles thereof into smaller ones, i.e., with less layers, less hidden units or less parameters. Pursuing that direction, Ba & Caruana (2014) showed that an ensemble of deep neural networks with few convolutional layers can be compressed into a single layer network as accurate as a deep one. In a complementary work, Hinton et al. (2015) focused on compressing ensembles of deep networks into deep networks of the same architecture. They also experimented with softening predictions of the teacher by dividing the logits by a constant greater than one called temperature. Using the techniques developed in prior work and adding an extra mimic layer in the middle of the student network, Romero et al. (2014) demonstrated that a moderately deep and wide convolutional network can be compressed into a deeper and narrower convolutional network with much fewer parameters than the teacher network while also increasing accuracy.
2.1 Bidirectional LSTM
One example of a very powerful neural network architecture yielding stateoftheart performance on a range of tasks, yet expensive to run at test time, is the long shortterm memory network (LSTM) (Hochreiter & Schmidhuber, 1997; Graves & Schmidhuber, 2005; Graves et al., 2013), which is a type of recurrent neural network (RNN). The focus of this work is to use this model as a teacher for model compression.
LSTMs exhibit superior performance not only in speech, but also in handwriting recognition and generation (Graves & Schmidhuber, 2009; Graves, 2014), machine translation (Sutskever et al., 2014) and parsing (Vinyals et al., 2015), thanks to their ability to learn longerrange interactions. For acoustic modeling though, the difference between a nonrecurrent network and an LSTM using fulllength sequences is two fold: the use of longer context while deciding on the current frame label, and the type of processing in each cell (the LSTM cell compared to a sigmoid or ReLU). The LSTM network used in our paper uses a fixedsize input sequence and only predicts the output for the middle item of the input sequence. In this respect, we follow the design proposed in the speech literature by Mohamed et al. (2015). We use acoustic models that use the same context window as a nonrecurrent network (limited to about 0.5 s) while using the LSTM cells for processing each frame. The LSTM cells process frames in the same bidirectional manner that any other bidirectional LSTM would do, but they are limited by the size of the contextual window. Details of the LSTM used in this work can be found in the supplementary material. This style of modelling has important benefits. One motivation to use such an architecture is that acoustic modeling labels (i.e. target states) are local in nature with an average duration of about 450 ms. Therefore, the amount of information about the class label decays rapidly as we move away from the target. Longterm relations between labels, on the other hand, are handled using a language model (during testing) or a lattice of competing hypotheses in case of lattice training (Veselý et al., 2013; Kingsbury, 2009). Another motivation to prefer models that utilise limited input windows is faster convergence due to the ability to randomise samples on the frame level rather than on the utterance level. A practical benefit of using a fixedlength window is that using bidirectional architectures becomes possible in real time setups when the delay in response cannot be long.
3 Visionstyle CNNs for speech recognition
Convolutional neural networks (LeCun et al., 1998) were considered for speech for many years (LeCun & Bengio, 1998; Lee et al., 2009), though only recently have become very successful (AbdelHamid et al., 2012; Sainath et al., 2013; AbdelHamid et al., 2014; Sainath et al., 2015). These CNN architectures are quite different from those used in computer vision. They use only two or three convolutional layers with large filters followed by more fully connected layers. They also only use convolution or pooling over one dimension, either time or frequency. When looking at a spectrogram in Figure 2, it is obvious that, like what we observe in vision, similar patterns reoccur both across different points in time and across different frequencies. Using convolution or pooling across only one of these dimensions seems suboptimal. One of the reasons for the success of CNNs is their invariance to small translations and scaling. Intuitively, small translations (corresponding to the pitch of voice) or scaling (corresponding to speaking slowly or quickly) should not change the class assigned to a window of speech. We hypothesise that classification of windows of speech with CNNs can be done more effectively with architectures similar to ones used in object recognition.
Looking at this problem through the lens of computer vision, we use a convolutional network architecture inspired by the work of Simonyan & Zisserman (2014). We only use small convolutional filters of size 33, nonoverlapping 22 pooling regions and our network also has more layers than networks previously considered for the purpose of speech recognition. The same architecture is shared between both baseline and student networks (described in detail in LABEL:fig:convnet_configuration, contrasted to a widely applied architecture proposed by Sainath et al. (2013)).
figure\end@float
4 Combining bidirectional LSTMs with visionstyle CNNs
Both LSTMs and CNNs are powerful models, but the mechanisms that guide their learning are quite different. That creates an opportunity to combine their predictions, implicitly averaging their inductive biases. A classic way to perform this is ensembling, that is, to mix posterior predictions of the two models in the following manner:
where . The notation and denotes probabilities of class given a feature vector , respectively for the LSTM and the baseline CNN. It is interesting to combinte these two types of models because they seem to meet the conditions of Dietterich (2000): “a necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse". Although ensembling is known to be very successful, it comes at the cost of executing all models at test time.
We propose an alternative which is to use model compression. Because capacity of the two models is similar we call it “model blending” rather than “model compression”. To combine the inductive biases of both the LSTM and the CNN, we can use a training objective that combines the loss function on the hard labels from the training data with a loss function which penalises deviation from predictions of the LSTM teacher. That is, we optimise
(1) 
where is the probability of class for training example estimated by the teacher, is the probability of class assigned to training example by the student and is the correct class for . The coefficient controls the weight of the errors on soft and hard labels in the objective. When the network is only learning using the hard labels, ignoring the teacher, while means that the networks is only learning from the soft labels provided by the teacher, ignoring the hard labels. When optimising the objective in Equation 1 yields a form of hybrid model which is learning using the guidance of the teacher, although not depending on it alone. With a symmetric objective, we could train an LSTM using the guidance of the CNN. Instead, we blend into the CNN for efficency at test time.
We motivate the choice of working directly with probabilities instead of logits (cf. section 2) in two ways. First, it is more direct to interpret retaining a subset of predictions of a network when considering probabilities (cf. Equation 2). We can simply look at the fraction of probability mass a subset of outputs covers. This will be necessary in our work (see section 5). Secondly, when using both soft and hard targets, it is easier to find an appropriate and learning rate, when the two objectives we are weighting together are of similar magnitudes and optimisation landscapes.
5 Experiments
In our experiments we use the Switchboard data set (Godfrey et al., 1992), which consists of 309 hours of transcribed speech. We used 308 hours as training set and kept one hour as a validation set. The data set is segmented into 248k utterances, i.e. continuous pieces of speech beginning and ending with a clear pause. Each utterance consists of a number of frames, i.e. 25 ms intervals of speech, with a constant shift of 10 ms. For every frame, the extracted features are 31channel Melfilterbank parameters passed through a 10th root nonlinearity. Features for one utterance are visualised in Figure 2. To form our training and validation sets we extract windows of 41 frames, that is, the frame whose label we want to predict, 20 frames before it and 20 frames after it. As shown in Figure 1, distribution of the lengths of utterances is highly nonuniform, therefore to keep the sampling unbiased, we sample training examples by first sampling an utterance proportionally to its length and then sampling a window within that utterance uniformly. To form the validation set we simply extract all possible windows. In both cases, we pad each utterance with zeros at the beginning and at the end so that every frame in each utterance can be drawn as a middle frame. Every frame in the training and validation set has a label. The 9000 output classes represent tied triphone states that are generated by an HMM/GMM system (Young et al., 1994). Forced alignment is used to map each input frame to one output state using a baseline DNN model. The distribution of classes in the training data is visualised in Figure 1. We call the frame classification error on the validation set frame error rate (FER).
The test set is a part of the standard Switchboard benchmark (Hub5’00 SW). It was sampled from the same distribution as the training set and consists of 1831 utterances. There are no framelevel labels in the test set and the final evaluation is based on the ability to predict words in the test utterances. To obtain the words predicted by the model, frame label posteriors generated from the neural network are first divided by their prior probabilities then passed to a finite state transducerbased decoder to be combined with 3gram language model probabilities to generate the most probable word sequences. Hypothesized word sequences are aligned to the human reference transcription to count the number of word insertions (), deletions (), and substitutions (). Word error rate (WER) is defined as , where is the total number of words in the reference transcription.
Since the teacher model is very slow at prediction time, it is impractical to run a large number of experiments if the teacher must repeatedly be executed to train each student. Unfortunately, running the teacher once and saving its predictions to disk is also problematic — because the output space is large (9000 classes), storing the soft labels for all classes would require a large amount of space ( 3.6 TB). Moreover, to sample each minibatch in an unbiased manner, we would need constant random access to disk, which again would make training very slow. To deal with that problem we save predictions only for the small subset of classes with the highest predicted probabilities. To determine whether this is a viable solution, we checked what percentage of the total probability mass, averaged over the examples in the training set, is covered by the most likely classes according to the teacher model. We denote that set . That is, we compute
(2) 
where denotes the posterior probability of class given a feature vector . This relationship for one of our LSTM models is shown in Figure 1. We found that, with very few exceptions, posteriors over classes are concentrated on very few values. Therefore, we decided to continue our experiments retaining top classes covering not more than 90 classes for each training example, cutting off after covering 99% of the probability mass. This allows us to store soft labels for the entire data set in the RAM, making unbiased sampling of training data efficient.
5.1 Baseline networks
We used Lasagne, which is based on Theano (Bergstra et al., 2010), for implementing CNNs. We used the architecture described in LABEL:fig:convnet_configuration. For training of the CNN baseline we used minibatches of size 256. Each epoch consisted of 2000 minibatches. Hyperparameters of the baseline networks were: initial learning rate (), momentum coefficient (0.9, we used Nesterov’s momentum) and a learning rate decay coefficient (0.7). Because the data set we used is very large (309 hours, 18 GB), the only form of regularisation we used was early stopping in the following form. After every epoch we measured the loss on the validation set. If the loss did not improve for five epochs, we multiplied the learning rate by a learning rate decay coefficient. We stopped the training after the learning rate was smaller than . It took about 200 epochs to finish. We repeated training with three different random seeds. For comparison, we also trained a CNN very similar to the one proposed by Sainath et al. (2013), adjusted to match the number of parameters in our network. We used the same training procedure and hyperparameters.
The bidirectional LSTM teacher networks we used are very similar to the one in the work of Mohamed et al. (2015). We trained three models, all with four hidden layers, two with 512 hidden units for each direction and one with 800 hidden units for each direction. The training starts with the learning rate equal to 0.05 for one of the smaller models and 0.08 for the other ones. We used the standard momentum with the coefficient of 0.9. After three epochs of no improvement of frame error rate on the validation set, the learning rate was multiplied by and training process was rolled back to the last epoch which improved validation error. Training stopped when learning rate was smaller than . It took about 75 epochs (2000 minibatches of 256 samples) to finish the training.
The results for these models are shown in Table 1. Our visionstyle CNN achieved 14.1 WER (averaged over three random seeds), the larger LSTM achieved 14.4 WER, and a CNN of an architecture proposed by Sainath et al. (2013) achieved 15.5 WER. Interestingly, although our LSTM teachers outperform our visionstyle CNN trained with hard labels in terms of FER, the results in WER, which is the metric of primary interest, are the opposite. This discrepancy between FER and WER has been observed in the speech community before, for example by Sak et al. (2014) for LSTMs and DNNs. FER and WER are not always perfectly correlated because FER is conditioned on a model that generated the frame alignment in the first place (which might not be correct for all the cases). FER also penalizes misclassifications of boundary frames which might not be of importance as long as the correct target state is recognized. On the other hand, WER is calculated taking into account information about neighbouring frames (i.e. smoothness) as well as external knowledge (e.g. a language model) which corrects many of the misclassifications made locally.
FER  WER  model size  execution time  
Sainath et al. (2013)style CNN  37.93%  15.5  75M  0.75 
visionstyle CNN  35.51%  14.1  75M  1.0 
smaller LSTM  34.27%  14.8  30M  3.3 
bigger LSTM  34.15%  14.4  65M  5.8 
LSTM + CNN ensemble ()  32.4%  13.4  130M  6.8 
LSTM CNN blending ()  34.11%  13.83  75M  1.0 
5.2 Ensembles of networks
The first approach we use to combine the two types of models is to create ensembles. The results in Figure 4 and Figure 4 indicate that for the problem we consider it is beneficial to combine neural networks from different families, which have different inductive biases. Even though CNNs are much weaker in FER, combining an LSTM with a visionstyle CNN achieves the same FER as an ensemble of two LSTMs (both of which are more accurate than the CNN), and actually yields better WER than ensembles of two CNNs or two (superior) LSTMs. Interestingly, ensembling two CNNs yields almost no benefit in WER. To complete the picture we tried ensembles with more than two models. An ensemble of three LSTMs achieved 31.98% FER and 13.7 WER, an ensemble of three CNNs achieved 33.31% FER and 13.9 WER, thus, in both cases, yielding very little gain over ensembles of two models of these types. On the other hand adding a CNN to the ensemble of two LSTMs yielded 31.57% FER and 13.2 WER. The benefits of adding more models to the ensemble appear to be negligible if the ensemble contained at least one model of each type already. Although the ensembles we trained are very effective in terms of WER, it comes at the cost of a large increase of computation at test time compared to the baseline CNN. Because of the cost of the LSTMs, our best twomodel ensemble is about 7 times slower than our visionstyle CNN (cf. Table 1).
We also compared the errors made by CNNs and LSTMs to see if the models are qualitatively different. We observe that a CNN tends to make similar errors as other CNNs, an LSTM tends to make similar errors as other LSTMs, but CNNs and LSTMs tend to make errors that are less similar to each other than the CNNCNN and LSTMLSTM comparisons. Overall, we conclude that the inductive biases of LSTM and CNN are complementary.
5.3 Networks trained with model blending via model compression
maximum number of classes retained  1  3  10  30  90 

average fraction of probability mass covered  73.20%  91.19%  96.68%  98.38%  99.03% 
average number of classes retained  1.0  2.89  6.79  13.33  23.96 
The next question we tackle is whether it is possible to achieve an effect similar to creating an ensemble without having to execute all models at prediction time. We attack this with model blending via model compression. To do this, we took predictions of the two best performing LSTMs in Table 1 and averaged their predictions to form a teacher model. Such a teacher model achieves 32.4% FER and 13.4 WER. As we mentioned earlier, it is infeasible to use all predictions during training. Therefore we only store a subset of classes predicted by the teacher for each frame. Table 2 shows what fraction of probability mass is covered when storing different maximum number of predictions (). It is particularly interesting to understand how many predictions of the teacher model are sufficient to achieve good performance to test the dark knowledge hypothesis (Hinton et al., 2015) which states that information about classes predicted with low probability is important to the success of model compression. We use . We also vary the parameter which controls how much the student is learning from the teacher and how much it is learning from hard labels. The architecture and training procedure for the students is the same as for the baseline. For every combination of and we report an average over three random seeds.
The results are shown in Figure 6 and Figure 6. The best model achieved lower FER (34.11%) and lower WER (13.83) than any of the individual models. Furthermore, the blended model has fewer parameters and is 6.8 times faster at test time than the ensemble of the LSTM and the CNN. For all numbers of teacher predictions retained () the best performance was achieved for . That highlights the importance of blending the knowledge extracted from the teacher model with learning from hard labels within the architecture of the student.
Our experiments show that, at least for this task, it is not critical to use the teacher predictions for all classes. Just the 30 most likely predictions ( of all classes!) is enough. Bringing the number up to 90 classes did not improve the student WER performance. However, performance deteriorates dramatically when too few predictions are used, suggesting that some dark knowledge is needed.
Finally, we experimented with using a CNN as a teacher for a CNN of the same architecture, i.e. we took predictions of a baseline visionstyle CNN and used its predictions to train another CNN of the same architecture (with and ). Such a student achieves on average (over 3 random trials) 34.61% FER and 14.1 WER, which is, as we expected, worse than a student of the LSTMs since the two models are more similar, but still significantly better than the baseline in terms of FER. These results are consistent with the results for ensembles (cf. Figure 4 and Figure 4). Clearly, blending dissimilar models like CNNs and LSTMs is stronger.
6 Related work
A few papers have applied model compression in speech recognition settings. The most similar is by Chan et al. (2015) who compressed LSTMs into small DNNs without convolutional layers. Using the soft labels from an LSTM, they were able to show an improvement in WER over the baseline trained with hard labels. The main difference between this work and ours is that their students are nonconvolutional and tiny. While this allows for a decent improvement over the baseline, since the student network is much smaller, its performance is still much weaker than performance of a single model of the same type as the teacher. Hence, this work addresses a different question than we do, i.e. whether a network without recurrent structure can perform as well or better as an LSTM when using soft labels provided by the LSTM. Model compression was also successfully applied to speech recognition by Li et al. (2014) who used DNNs without convolutional layers both as a teacher and a student. The architecture of the two networks was the same except that the student had less hidden units in each layer. Finally, work in the opposite direction was done by Wang et al. (2015) and Tang et al. (2015). They demonstrated that when using a small data set for which an LSTM is overfitting, a deep nonconvolutional network can provide useful guidance for the LSTM. It can come either in the form of pretraining the LSTM with soft labels from a DNN or training the LSTM optimising a loss mixing hard labels with soft labels from a DNN. We are not aware of previous work on model compression in the setting, in which the student and the teacher are of similar capacity.
7 Discussion
The main contribution of this paper is introducing the use of model compression in an unexplored setting where both the teacher and student architectures are powerful ones, yet with different inductive biases. Thus, rather than calling it model compression we use the term “model blending”. We showed that the LSTM and the CNN learn different kinds of knowledge from the data which can be leveraged through simple ensembling or model blending via model compression. We provided experimental evidence that CNNs of appropriate visionstyle architecture have the necessary capacity to learn accurate predictors on large speech data sets and gave a simple, practical recipe for improving the performance of CNNbased speech recognition models even further at no cost during test time. We hypothesise that the very recent advances in training even deeper convolutional networks for computer vision (Srivastava et al., 2015; He et al., 2015) will yield improved performance in speech recognition and would further improve our results. Finally, by using a CNN to teach a CNN, we have shown a very easy way of improving a neural network without training networks of more than one architecture or even forming ensembles.
Acknowledgments
We thank Stanisław Jastrzębski for suggesting the name of the paper. We also thank Steve Renals and Paweł Świętojański for insightful comments.
References
 AbdelHamid et al. (2012) Ossama AbdelHamid, Abdelrahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional neural networks concepts to hybrid NNHMM model for speech recognition. In ICASSP, 2012.
 AbdelHamid et al. (2014) Ossama AbdelHamid, Abdelrahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. TASLP, 22, 2014.
 Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014.
 Bergstra et al. (2010) James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In SciPy, 2010.
 Bucila et al. (2006) Cristian Bucila, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In KDD, 2006.
 Chan et al. (2015) William Chan, Nan Rosemary Ke, and Ian Laner. Transferring knowledge from a RNN to a DNN. arXiv:1504.01483, 2015.
 Dauphin & Bengio (2013) Yann Dauphin and Yoshua Bengio. Big neural networks waste capacity. arXiv:1301.3583, 2013.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, and Nando de Freitas. Predicting parameters in deep learning. In NIPS, 2013.
 Dietterich (2000) Thomas G. Dietterich. Ensemble methods in machine learning. In MCS, 2000.
 Gers et al. (2003) Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. Learning precise timing with LSTM recurrent networks. JMLR, 3, 2003.
 Godfrey et al. (1992) John J. Godfrey, Edward C. Holliman, and Jane McDaniel. Switchboard: telephone speech corpus for research and development. In ICASSP, 1992.
 Graves (2012) Alex Graves. Supervised sequence labelling with recurrent neural networks. 2012.
 Graves (2014) Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2014.
 Graves & Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(56), 2005.
 Graves & Schmidhuber (2009) Alex Graves and Jürgen Schmidhuber. Offine handwriting recognition with multidimensional rnns. In NIPS, 2009.
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8), 1997.
 Kingsbury (2009) Brian Kingsbury. Latticebased optimization of sequence classification criteria for neuralnetwork acoustic modeling. In ICASSP, 2009.
 Le Cun et al. (1990) Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In NIPS, 1990.
 LeCun & Bengio (1998) Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks. 1998.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
 Lee et al. (2009) Honglak Lee, Peter Pham, Yan Largman, and Andrew Y. Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS, 2009.
 Li et al. (2014) Jinyu Li, Rui Zhao, JuiTing Huang, and Yifan Gong. Learning smallsize dnn with outputdistributionbased criteria. In INTERSPEECH, 2014.
 Mohamed et al. (2015) Abdelrahman Mohamed, Frank Seide, Dong Yu, Jasha Droppo, Andreas Stolcke, Geoffrey Zweig, and Gerald Penn. Deep bidirectional recurrent networks over spectral windows. In ASRU, 2015.
 Romero et al. (2014) Adriana Romero, Nicolas Ballas, Samira Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2014.
 Sainath et al. (2013) Tara Sainath, Abdelrahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for LVCSR. In ICASSP, 2013.
 Sainath et al. (2015) Tara Sainath, Brian Kingsbury, George Saon, Hagen Soltau, Abdelrahman Mohamed, George Dahl, and Bhuvana Ramabhadran. Deep convolutional neural networks for largescale speech tasks. Neural Networks, 64, 2015.
 Sak et al. (2014) Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:1402.1128, 2014.
 Schuster & Paliwal (1997) Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. TSP, 45(11), 1997.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2014.
 Srivastava et al. (2015) Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NIPS, 2015.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
 Tang et al. (2015) Zhiyuan Tang, Dong Wang, Yiqiao Pan, and Zhiyong Zhang. Knowledge transfer pretraining. arXiv:1506.02256, 2015.
 Urban et al. (2016) Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdel rahman Mohamed, Matthai Philipose, and Matthew Richardson. Do deep convolutional nets really need to be deep (or even convolutional)? In ICLR (workshop track), 2016.
 Veselý et al. (2013) Karel Veselý, Arnab Ghoshal, Lukáŝ Burget, and Daniel Povey. Sequence discriminative training of deep neural networks. In INTERSPEECH, 2013.
 Vinyals et al. (2015) Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In NIPS, 2015.
 Wang et al. (2015) Dong Wang, Chao Liu, Zhiyuan Tang, Zhiyong Zhang, and Mengyuan Zhao. Recurrent neural network training with dark knowledge transfer. arXiv:1505.04630, 2015.
 Young et al. (1994) S. J. Young, J. J. Odell, and P. C. Woodland. Treebased state tying for high accuracy acoustic modelling. In HLT, 1994.
Supplementary material
Details of the LSTM
Given a sequence of input vectors , an RNN computes the hidden vector sequence by iterating the following from to :
The terms denote weight matrices (e.g. is the inputhidden weight matrix), the terms denote bias vectors (e.g. is hidden bias vector) and is the hidden layer function.
While there are multiple possible choices for , prior work (Graves, 2012; Graves et al., 2013; Sak et al., 2014) has shown that the LSTM architecture, which uses purposebuilt memory cells to store information, is better at finding and exploiting longer context. The left panel of Figure 7 illustrates a single LSTM memory cell. For the version of the LSTM cell used in this paper (Gers et al., 2003) is implemented by the following composite function:
where is the logistic sigmoid function, and , , and are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden vector . The weight matrices from the cell to gate vectors (e.g. ) are diagonal, so element in each gate vector only receives input from element of the cell vector.

One shortcoming of conventional RNNs is that they are only able to make use of previous context. Bidirectional RNNs (BRNNs) (Schuster & Paliwal, 1997) exploit past and future context by processing the data in both directions with two separate hidden layers, which are then fed forwards to the same output layer. A BRNN computes the forward hidden sequence and the backward hidden sequence by iterating from to :
Combining BRNNs with LSTMs gives the bidirectional LSTM, which can access the context in both directions.
Finally, deep RNNs can be created by stacking multiple RNN hidden layers on top of each other, with the output sequence of one layer forming the input sequence for the next. Assuming the same hidden layer function is used for all layers in the stack, the hidden vector sequences are iteratively computed from to and to :
where we define .
Deep bidirectional RNNs can be implemented by replacing each hidden sequence with the forward and backward sequences and , and ensuring that every hidden layer receives input from both the forward and backward layers at the level below. If bidirectional LSTMs are used for the hidden layers we get deep bidirectional LSTMs, the architecture we use as a teacher network in this paper.
In this work, following Mohamed et al. (2015), we only predict the label of the middle frame, hence the network output is computed as . This also implies that in the last hidden layer the forward sequence runs only from 1 to and the backward sequence runs only from to . This is illustrated in the right panel of Figure 7.