Recurrent Neural Networks With Limited Numerical Precision
Abstract
Recurrent Neural Networks (RNNs) produce stateofart performance on many machine learning tasks but their demand on resources in terms of memory and computational power are often high. Therefore, there is a great interest in optimizing the computations performed with these models especially when considering development of specialized lowpower hardware for deep networks. One way of reducing the computational needs is to limit the numerical precision of the network weights and biases. This has led to different proposed rounding methods which have been applied so far to only Convolutional Neural Networks and FullyConnected Networks. This paper addresses the question of how to best reduce weight precision during training in the case of RNNs. We present results from the use of different stochastic and deterministic reduced precision training methods applied to three major RNN types which are then tested on several datasets. The results show that the weight binarization methods do not work with the RNNs. However, the stochastic and deterministic ternarization, and pow2ternarization methods gave rise to lowprecision RNNs that produce similar and even higher accuracy on certain datasets therefore providing a path towards training more efficient implementations of RNNs in specialized hardware.
1 Introduction
A Recurrent Neural Network (RNN) is a specific type of neural network which is able to process input and output sequences of variable length. Because of this nature, RNNs are suitable for sequence modeling. Various RNN architectures have been proposed in recent years, based on different forms of nonlinearity, such as the Gated Recurrent Unit (GRU) (Cho et al., 2014) and LongShort Term Memory (LSTM) (Hochreiter et al., 1997). They have enabled new levels of performance in many tasks such as speech recognition (Amodei et al., 2015)(Chan et al., 2015), machine translation (Devlin et al., 2014)(Chung et al., 2016)(Sutskever et al., 2014), or even video games (Mnih et al., 2015) and Go(Silver et al., 2016).
Compared to standard feedforward networks, RNNs often take longer to train and are more demanding in memory and computational power. For example, it can take weeks to train models for stateoftheart machine translation and speech recognition. Thus it is of vital importance to accelerate computation and reduce training time of such networks. On the other hand, even at runtime, these models require too much in terms of computational resources if we want to deploy a model onto lowpower embedded hardware devices. Increasingly, dedicated deep learning hardware platforms including FPGAs (Farabet et al., 2011) and custom chips (Sim et al., 2016) are reporting higher computational efficiencies of up to tera operations per second per watt (TOPS/W). These platforms are targeted at deep CNNs. If lowprecision RNNs are able to report the same performance, then the savings in the reduction of multipliers (the circuits that take the space and energy) and memory storage of the weights would be even larger as the bit precision of the multipliers needed for the 2 to 3 gates of the gated RNN units can be reduced or the multipliers removed completely.
Previous work showed the successful application of stochastic rounding strategies on feed forward networks, including binarization (Courbariaux et al., 2015) and ternarization (Lin et al., 2015) of weights of vanilla Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) (Rastegari et al., 2016), and in (Courbariaux and Bengio, 2016) even the quantization of their activations, during training and runtime. Quantization of RNN weights has so far only been used with pretrained models Shin et al. (2016).
What remained an open question up to now was whether these weight quantization techniques could successfully be applied to RNNs during training.
In this paper, we use different methods to reduce the numerical precision of weights in RNNs, and test their performance on different benchmark datasets. We make the code for the rounding methods available. ^{1}^{1}1https://github.com/ottj/QuantizedRNN We use three popular RNN models: vanilla RNNs, GRUs, and LSTMs. Section 2 covers the 4 ways of obtaining lowprecision weights for the RNN models in this work, and Section 3 elaborates on the test results of the lowprecision RNN models on different datasets including the large WSJ dataset. We find that ternary quantization works very well while binary quantization fails and we analyze this result.
2 Rounding Network Weights
This work evaluates the use of 4 different rounding methods on the weights of various types of RNNs. These methods include the stochastic and deterministic binarization method (BinaryConnect) (Courbariaux and Bengio, 2016) and ternarization method (TernaryConnect) (Lin et al., 2015), the pow2ternarization method (Stromatias et al., 2015), and a new weight quantization method (Section 2.2). For all 4 methods, we keep a fullprecision copy of the weights and biases during training to accumulate the small updates, while during test time, we can either use the learned fullprecision weights or use their deterministic lowprecision version. As experimental results in Section 4 show, the network with learned fullprecision weights usually yields better results than a baseline network with full precision during training, due to the extra regularization effect brought by stochastic binarization. The deterministic lowprecision version could still yield comparable performance while drastically reducing computation and required memory storage at test time. We will briefly describe the former 3 lowprecision methods, and introduce a new fourth method called Exponential Quantization.
2.1 Binarization, Ternarization, and Pow2Ternarization
BinaryConnect and TernaryConnect were first introduced in (Courbariaux et al., 2015) and (Lin et al., 2015) respectively. By limiting the weights to only 2 or 3 possible values, i.e., 1 or 1 for BinaryConnect and 1, 0 or 1 for TernaryConnect, these methods do not require the use of multiplications. In the stochastic versions of both methods, the lowprecision weights are obtained by stochastic sampling, while in the deterministic versions, the weights are obtained by thresholding.
Let be a matrix or vector to be binarized. The stochastic BinaryConnect update works as follows:
(1) 
where is the hard sigmoid function:
(2) 
while in the deterministic BinaryConnect method, lowprecision weights are obtained by thresholding the weight value by .
(3) 
TernaryConnect allows weights to be additionally set to zero. Formally, the stochastic form can be expressed as
(4) 
where is an elementwise multiplication. In the deterministic form, the weights are quantized depending on 2 thresholds:
(5) 
Pow2ternarization is another fixedpoint oriented rounding method introduced in (Stromatias et al., 2015). The precision of fixedpoint numbers is described by the Qm.fnotation, where m denotes the number of integer bits including the sign bit, and f the number of fractional bits. For example, Q1.1 allows as values. The rounding procedure works as follows: We first round the values to be in range allowed by the number of integer bits:
(6) 
We subsequently round the fractional part of the values:
(7) 
2.2 Exponential Quantization
Quantizing the weight values to an integer power of 2 is also a way of storing weights in low precision and eliminating multiplications. Since quantization does not require a hard clipping of weight values, it scales well with weight values.
Similar to the methods introduced above, we also have a deterministic and stochastic way of quantization. For the stochastic quantization, we sample the logarithm of weight value to be its nearest 2 integers, and the probability of getting one of them is proportional to the distance of the weight value from that integer. For weights with negative weight values, we take the logarithm of its absolute vale, but add their sign back after quantization. i.e.:
(8) 
For the deterministic version, we just set if the in Eq. 8 is larger than 0.5.
Note that we just need to store the logarithm of quantized weight values. The actual instruction needed for multiplying a quantized number differs according to the numerical format. For fixed point representation, multiplying by a quantized value is equivalent to binary shifts, while for floating point representation, that is equivalent to adding the quantized number’s exponent to the exponent of the floating point number. In either case, no complex operation like multiplication would be needed.
3 LowPrecision Recurrent Architectures
3.1 Vanilla Recurrent Networks
As the most basic RNN structure, the vanilla RNN just adds a simple extension to feed forward networks. Its hidden states updated from both the current input and the state at the previous time step:
(9) 
where denotes the nonlinear activation function. The hidden state can be followed by more layers to yield an output at each time step. For example, in characterlevel language modeling, the output at each timestep is set to be the probability of each character appearing at the next time step. Thus there is a softmax layer that transforms the hidden state representation into predictive probabilities.
(10) 
In the lowprecision version of the RNN, we just apply a quantization function to each of the weights in the aforementioned RNN structure. Thus all multiplications in the forward pass (except for the softmax normalization) will be eliminated:
(11)  
(12) 
where is applied elementwise to all weights in a given weight matrix. We should note that, because of the quantization process, the derivative of the cost with respect to weights is no longer smooth in the lowprecision RNN (it is 0 almost everywhere). We instead compute the derivative with respect to the quantized weights, and use that derivative for weight update. In other words, the gradients are computed as if the quantization operation were not there. This makes sense because we can think of the quantization operation as adding noise:
(13) 
3.1.1 Hiddenstate Stability
We have observed among all the three RNN architectures that BinaryConnect on the recurrent weights never worked.We should note that, the function in the recurrent direction has to allow to be sampled. We conjecture that this is related to the stabilization of hidden states.
Consider the effect that BinaryConnect and TernaryConnect have on the Jacobians of the statetostate transition. In BinaryConnect, all entries in matrix are sampled to be or . In LSTMs and GRUs, there is a strong near1 diagonal in the Jacobian because the gates are more often to be turned on, i.e., letting information flow through it, while the offdiagonal entries of the Jacobian tend to be much smaller when the weights have not been quantized. However, when the true value of a weight is near zero, its quantized value is stochastically sampled to be or with nearly equal probability. When near0 offdiagonal entries of a matrix of real values between 1 and 1 are randomly replaced by values near +1 or 1, the magnitude of the weights increases and the condition number of the matrix will tend to worsen due to the presence of more near0 eigenvalues. This could mean that gradients tend to vanish faster, because a gradient vector could happen more often to have strong components in the directions of some of these small eigenvectors. With larger eigenvalues of the Jacobian (observed, Fig. 1(a)), i.e. larger derivatives, we could also see gradients explode.
In Fig. 0(a), where we use unbounded units (ReLU) as activation, if we look at the Jacobian of two neighboring hidden states (), we can see that the maximum eigenvalue of it is around 2.5 across all time steps, much larger than 1. As a consequence, hidden states explode with respect to time steps, while this is not the case for TernaryConnect and ExpQuantize. (Fig. 0(b))
On the other hand, if we allow (or a sufficiently small value) to be chosen in the sampling process, the effect of stochastic sampling on the Jacobians will not be that devastating. The Jacobian remains a quasidiagonal matrix, which is wellconditioned.
In (Krueger and Memisevic, 2015) it was shown that in a trained model, hidden state norms change in the first several timesteps, but become stable afterwards. The model can work even better if during training we punish the changes of the norm of the hidden state from one time step to the next.
3.2 Long ShortTerm Memory
LSTMs (Hochreiter et al., 1997) were first introduced in RNNs for sequence modeling. Its gate mechanism makes it a good option to deal with the vanishing gradient problem, since it can model longterm dependencies in the data. To limit the numerical precision, we apply a rounding method to all or a subset of weights.
3.3 Gated Recurrent Unit
GRUs (Cho et al., 2014) can also be used in RNNs for modeling temporal sequences.
They involve less computation than the LSTM units, since they do not have an output gate, and are therefore sometimes preferred in large models. At timestep , the state of a single GRU unit is computed as follows:
(14) 
where denotes a elementwise multiplication. The update gate is computed with
(15) 
where is the input at timestep , is the statetostate recurrent weight matrix, is the state at , is the inputtohidden weight matrix, and is the bias.
The reset gate is computed as follows:
(16) 
where
(17) 
In our experiments, the weights are rounded in the same way as the LSTMs. For example, for the gate, the input weight is rounded as follows: .
4 Experimental Results and Discussion
In the following experiments, we test the effectiveness of the different rounding methods on two different types of applications: characterlevel language modeling and speech recognition. The different RNN types (Vanilla RNN, GRU, and LSTM) are evaluated on experiments using four different datasets.
4.1 Vanilla RNN
We validate the lowprecision vanilla RNN on 2 datasets: text8 and Penn Treebank Corpus (PTB).
The text8 dataset contains the first 100M characters from Wikipedia, excluding all punctuations. It does not discriminate between cases, so its alphabet has only 27 different characters: the 26 English characters and space. We take the first 90M characters as training set, and split them equally into sequences with 50 character length each. The last 10M characters are split equally to form validation and test sets.
The Penn Treebank Corpus (Taylor et al., 2003) contains 50 different characters, including English characters, numbers, and punctuations. We follow the settings in (Mikolov et al., 2012) to split our dataset, i.e., 5017k characters for training set, 393k and 442k characters for validation and test set respectively.
Model and Training
The models are built to predict the next character given the previous ones, and performances are evaluated with the bitspercharacter (BPC) metric, which is of the perplexity, or the percharacter loglikelihood (base 2). We use a RNN with ReLU activation and 2048 hidden units. We initialize hiddentohidden weights as identity matrices, while inputtohidden and hiddentooutput matrices are initialized with uniform noise.
We can see the regularization effect of stochastic quantization from the results of the two datasets. In the PTB dataset, where the model size slightly overfits the dataset, the lowprecision model trained with stochastic quantization yields a test set performance of 1.372 BPC, which surpasses its full precision baseline (1.505 BPC) by around 0.133 BPC (Fig. 2, left). From the figure we can see that stochastic quantization does not significantly hurt training speed, and manages to get better generalization when the baseline model begins to overfit. On the other hand, we can also see from the results on the text8 dataset where the same sized model now underfits, the lowprecision model performs worse (1.639 BPC) than its baseline (1.588 BPC). (Fig. 1(b) and Table 1).
4.2 GRU RNNs
This section presents results from the various methods to limit numerical precision in the weights and biases of GRU RNNs which are then tested on the TIDIGITS dataset.
Dataset
TIDIGITS(Leonard and Doddington, 1993) is a speech dataset consisting of clean speech of spoken numbers from 326 speakers. We only use single digit samples (zero to nine) in our experiments giving us 2464 training samples and 2486 validation samples. The labels for the spoken ‘zero’ and ‘O’ are combined into one label, hence we have 10 possible labels. We create MFCCs from the raw waveform and do leading zero padding to get samples of matrix size 39x200. The MFCC data is further whitened before use. We only use masking for processing the data with the RNN in some of the experiments.
Model and Training
The model has a 200 unit GRU layer followed by a 200 unit fullyconnected ReLU layer. The output is a 10 unit softmax layer. Weights are initialized using the Glorot & Bengio method (Glorot and Bengio, 2010). The network is trained using Adam (Kingma and Ba, 2014) and BatchNorm (Ioffe and Szegedy, 2015) is not used. We train our model for up to 400 epochs with a patience setting of 100 (no early stopping before epoch 100). GRU results in Table 1 are from 10 experiments, each experiment starts with a different random seed. This is done because previous experiments have shown that different random seeds can lead up to a few percent difference in final accuracy. We show average and maximum achieved performance on the validation set.
Binarization of Weights
To evaluate weight binarization on a GRU, we trained our model with possible binary values {1,1}, {0.5, 0.5}, {0, 1}, {0.5, 0} for the weights. Binarization was done only on the weights matrices , , . We ran each experiment once with stochastic binarization and once with deterministic binarization. As shown in Table1, none of the combinations resulted in an increase in accuracy over chance after 400 training epochs. Also, doubling the number of GRU units to 400 did not help. We therefore concluded that GRUs do not function properly if all the weights are binarized. It has yet to be tested if at least a subset of the aforementioned weight matrices, or some of the hiddentohidden weight matrices could be binarized.
Effect of Pow2Ternarization
To assess the impact of weight ternarization, we trained our model and quantized the weights during training using pow2ternarization with Q1.1.
Figure 3 (a)
shows how pow2ternarization rounding applied on the different sets of GRU weights has an effect on convergence compared to the fullprecision baseline.
If full precision weights and biases are used, convergence starts after a few training epochs. As shown in Table 1, if pow2ternarization is used on inputtoGRU weights, the top1 improves to 99.3%. Training takes 10 epochs longer before convergence starts, but then surpasses the baseline in terms of convergence speed, also the variance between the different runs is smaller compared to baseline runs. Limiting the precision of both inputtoGRU weights and biases leads to a similar training curve, but top1score increases to 99.42%. If pow2ternarization is applied on all GRU weights (now also on and biases, the top1 decreases (though still higher than baseline) to to 99.1%.
This shows that limiting the numerical precision of inputtoGRU weights and biases is beneficial for learning in this setup: although it slows down convergence, the final accuracy is higher than that of the baseline.
Dataset  RNN Type  Baseline  SB  DB  DT  ST  PT  EQ  
text8  VRNN  1.588 BPC  N/A  N/A  1.639 BPC  
PTC  VRNN  1.505 BPC  N/A  N/A  1.372 BPC  
TIDIGITS 


18.7  18.7 





















WSJ  LSTM 


Effect of Ternarization
Figure 3 (b) shows that we see the same effects as with pow2ternarization, except for the case where we ternarize all weights and biases. With Pow2 Ternarization we allow 0.5, 0, and 0.5 as values. With the default ternarization we allow 1, 0,1. This difference has a big impact on the hiddentohidden weight function, because if we apply ternarization there, we end up with lowerthanbaseline performance and much slower convergence. On the other hand, if we apply ternarization only on the inputtoGRU weights, we get 99.67%, the highest of all top1 score of our TIDIGITS experiments. This leads us to conclude that different GRU components need different sets of allowed values to function in an optimal fashion. Indeed, if we change ternarization of all weights and biases to 0.5, 0, and 0.5 as allowed values, we see basically the same result as with pow2ternarization. Stochastic ternarization has not shown to be useful here. Convergence starts after 100 training epochs, and the average maximum and top1 accuracy of 97.72% and 98.23% are almost at baseline level.
4.3 LSTM RNNs
Previous work had shown that some forms of network binarization work on small datasets but do not scale well to big datasets (Rastegari et al., 2016). To determine if lowprecision networks still work on big datasets, we chose to train a large model on the WSJ dataset.
Dataset
The model is trained on the Wall Street Journal (WSJ) corpus (available at the LDC as LDC93S6B and LDC94S13B) where we use the 81 hour training set "si284". The development set "dev93" is used for early stopping and the evaluation is performed on the test set "eval92". We use 40 dimensional filter bank features extended with deltas and deltadeltas, leading to 120 dimensional features per frame. Each dimension is normalized to have zero mean and unit variance over the training set. Following the text preprocessing in (Miao et al., 2015), we use 59 character labels for characterbased acoustic modeling. Decoding with the language model is performed on a recent proposed approach (Miao et al., 2015) based on both Connectionist Temporal Classification (CTC) (Graves et al., 2006) and weighted finitestate transducers (WFSTs) (Mohri et al., 2008).
Model and Training
Both the limited precision model and the baseline model have 4 bidirectional LSTM layers with 250 units in both directions of each layer. In order to get the unsegmented character labels directly, we use CTC on top of the model. The baseline and the model are trained using Adam (Kingma and Ba, 2014) with a fixed learning rate . The weights are initialized following the scheme (Glorot and Bengio, 2010). Notice that we do not regularize the model like injecting weight noise for simplicity, thus the baseline results shown here could be worse than the recent published numbers on the same task (Graves and Jaitly, 2014; Miao et al., 2015).
Pow2Ternarization on Weights
The baseline achieves a word error rate (WER) of 11.16% on the test set after training for 60 epochs, which took 8 days. The pow2ternarization method has a considerably slower convergence similar to the GRU experiments. The model was trained for 3 weeks up to epoch 87, where it reaches an WER of 10.49%.
5 Conclusion and Outlook
This paper shows for the first time how lowprecision quantization of weights can be performed already during training effectively for RNNs. We presented 3 existing methods and introduced 1 new method of limiting the numerical precision. We used the different methods on 3 major RNN types and determined how the limited numerical precision affects network performance across 4 datasets.
In the language modeling task, the lowprecision model surpasses its fullprecision baseline by a large gap (0.133 BPC) on the PTB dataset. We also show that the model will work better if put in a slightly overfitting setting, so that the regularization effect of stochastic quantization will begin to function. In the speech recognition task, we show that it is not possible to binarize weights of GRUs while maintaining their functionality. We conjecture that the better performance from ternarization is due to a reduced variance of the weighted sums (when a nearzero real value is quantized to +1 or 1, this introduces substantial variance), which could be more harmful in RNNs because the same weight matrices are used over and over again along the temporal sequence. Furthermore, we show that weight and bias quantization methods using ternarization, pow2ternarization, and exponential quantization, can improve performance over the baseline on the TIDIGITs dataset. The successful outcome of these experiments means that lower resource requirements are needed for custom implementations of RNN models.
6 Acknowledgments
We are grateful to INI members Danny Neil, Stefan Braun, and Enea Ceolini, and MILA members Philemon Brakel, Mohammad Pezeshki, and Matthieu Courbariaux, for useful discussions and help with data preparation.
We thank the developers of Theano(Theano Development Team, 2016), Lasagne, Keras, Blocks(van Merrienboer et al., 2015), and Kaldi(Povey et al., 2011).
The authors acknowledge partial funding from the Samsung Advanced Institute of Technology, University of Zurich, NSERC,
CIFAR and Canada Research Chairs.
References
 Amodei et al. [2015] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. DeepSpeech 2: Endtoend speech recognition in English and Mandarin. arXiv, page 28, 2015. URL http://arxiv.org/abs/1512.02595.
 Chan et al. [2015] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint, pages 1–16, 2015. URL http://arxiv.org/abs/1508.01211.
 Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv, 2014. URL http://arxiv.org/abs/1409.1259.
 Chung et al. [2016] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. A characterlevel decoder without explicit segmentation for neural machine translation. arXiv, 2016. URL http://arxiv.org/abs/1603.06147.
 Courbariaux and Bengio [2016] Matthieu Courbariaux and Yoshua Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv, 2016. URL http://arxiv.org/abs/1602.02830.
 Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. arXiv, pages 1–9, 2015. URL http://arxiv.org/abs/1511.00363.
 Devlin et al. [2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. ACL, 17:1370–1380, 2014. URL http://acl2014.org/acl2014/P141/pdf/P141129.pdf.
 Farabet et al. [2011] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In CVPR 2011 Workshop, pages 109–116, June 2011. doi: 10.1109/CVPRW.2011.5981829.
 Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th AISTATS, 9:249–256, 2010. ISSN 15324435. doi: 10.1.1.207.2059. URL http://machinelearning.wustl.edu/mlpapers/paper{_}files/AISTATS2010{_}GlorotB10.pdf.
 Graves and Jaitly [2014] Alex Graves and Navdeep Jaitly. Towards endtoend speech recognition with recurrent neural networks. In Proc. 31st Int. Conf. Mach. Learn., pages 1764–1772, 2014.
 Graves et al. [2006] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. Connectionist Temporal Classification : Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international conference on Machine Learning, pages 369–376, 2006. ISSN 10987576. doi: 10.1145/1143844.1143891.
 Hochreiter et al. [1997] Sepp Hochreiter, S Hochreiter, Jürgen Schmidhuber, and J Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–80, 1997. ISSN 08997667. doi: 10.1162/neco.1997.9.8.1735. URL http://www.ncbi.nlm.nih.gov/pubmed/9377276.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015. ISSN 07176163. doi: 10.1007/s1339801401737.2. URL http://arxiv.org/abs/1502.03167.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], pages 1–15, 2014. URL http://arxiv.org/abs/1412.6980$\$nhttp://www.arxiv.org/pdf/1412.6980.pdf.
 Krueger and Memisevic [2015] David Krueger and Roland Memisevic. Regularizing RNNs by Stabilizing Activations. arXiv preprint, 2015. URL http://arxiv.org/abs/1511.08400.
 Leonard and Doddington [1993] R G Leonard and G Doddington. Tidigits Ldc93S10, 1993.
 Lin et al. [2015] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural Networks with Few Multiplications. pages 1–8, 2015. URL http://arxiv.org/abs/1510.03009.
 Miao et al. [2015] Yajie Miao, Mohammad Gowayyed, and Florian Metze. EESEN: Endtoend speech recognition using deep RNN models and WFSTbased decoding. arXiv:1507.08240, (Cd), 2015. doi: 10.1109/ASRU.2015.7404790. URL http://arxiv.org/abs/1507.08240.
 Mikolov et al. [2012] Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, HaiSon Le, Stefan Kombrink, and J Cernocky. Subword language modeling with neural networks. preprint, 2012. URL http://www.fit.vutbr.cz/imikolov/rnnlm/char.pdf.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. ISSN 00280836. doi: 10.1038/nature14236. URL http://dx.doi.org/10.1038/nature14236.
 Mohri et al. [2008] Mehryar Mohri, Fernando Pereira, and Michael Riley. Speech recognition with weighted finitestate transducers. Springer Handbook of Speech Processing, 16(1):1–31, 2008. doi: 10.1007/9783540491279_28.
 Povey et al. [2011] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit, 2011. URL http://kaldiasr.org.
 Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: ImageNet classification using binary convolutional neural networks. arXiv preprint, pages 1–17, 2016. URL http://arxiv.org/abs/1603.05279.
 Shin et al. [2016] Sungho Shin, Kyuyeon Hwang, and Wonyong Sung. Fixedpoint performance analysis of recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2016.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 00280836. doi: 10.1038/nature16961. URL http://www.nature.com/doifinder/10.1038/nature16961.
 Sim et al. [2016] J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim. A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In 2016 IEEE International SolidState Circuits Conference (ISSCC), pages 264–265, Jan 2016. doi: 10.1109/ISSCC.2016.7418008.
 Stromatias et al. [2015] Evangelos Stromatias, Daniel Neil, Michael Pfeiffer, Francesco Galluppi, Steve B Furber, and ShihChii Liu. Robustness of spiking Deep Belief Networks to noise and reduced bit precision of neuroinspired hardware platforms . Frontiers in Neuroscience, 9(July):1–14, 2015. ISSN 1662453X. doi: 10.3389/fnins.2015.00222. URL http://www.frontiersin.org/Journal/Abstract.aspx?s=755{&}name=neuromorphic{_}engineering{&}ART{_}DOI=10.3389/fnins.2015.00222.
 Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS), pages 3104–3112, 2014. URL http://papers.nips.cc/paper/5346sequencetosequencelearningwithneural.
 Taylor et al. [2003] Ann Taylor, Mitchell Marcus, and Beatrice Santorini. the Penn Treebank: an Overview. Treebanks, pages 5–22, 2003. doi: 10.1007/9789401002011_1. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216{&}rep=rep1{&}type=pdf.
 Theano Development Team [2016] Theano Development Team. Theano: A {Python} framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.0, 2016. URL http://arxiv.org/abs/1605.02688.
 van Merrienboer et al. [2015] Bart van Merrienboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Wardefarley, Jan Chorowski, and Yoshua Bengio. Blocks and Fuel : Frameworks for deep learning. pages 1–5, 2015. URL http://arxiv.org/abs/1506.00619.