Recurrent Neural Networks with External Memory for
Language Understanding
Abstract
Recurrent Neural Networks (RNNs) have become increasingly popular for the task of language understanding. In this task, a semantic tagger is deployed to associate a semantic label to each word in an input sequence. The success of RNN may be attributed to its ability to memorize longterm dependence that relates the currenttime semantic label prediction to the observations many time instances away. However, the memory capacity of simple RNNs is limited because of the gradient vanishing and exploding problem. We propose to use an external memory to improve memorization capability of RNNs. We conducted experiments on the ATIS dataset, and observed that the proposed model was able to achieve the stateoftheart results. We compare our proposed model with alternative models and report analysis results that may provide insights for future research.
Recurrent Neural Networks with External Memory for
Language Understanding
Baolin Peng, Kaisheng Yao 
The Chinese University of Hong Kong 
Microsoft Research 
blpeng@se.cuhk.edu.hk, kaisheny@microsoft.com 
Index Terms: Recurrent Neural Network, Language Understanding, Long ShortTerm Memory, Neural Turing Machine
1 Introduction
Neural network based methods have recently demonstrated promising results on many natural language processing tasks [1, 2]. Specifically, recurrent neural networks (RNNs) based methods have shown strong performances, for example, in language modeling [3], language understanding [4], and machine translation [5, 6] tasks.
The main task of a language understanding (LU) system is to associate words with semantic meanings [7, 8, 9]. For example, in the sentence ”Please book me a ticket from Hong Kong to Seattle”, a LU system should tag ”Hong Kong” as the departurecity of a trip and ”Seattle” as its arrival city. The widely used approaches include conditional random fields (CRFs) [8, 10], support vector machine [11], and, more recently, RNNs [4, 12].
A RNN consists of an input, a recurrent hidden layer, and an output layer. The input layer reads each word and the output layer produces probabilities of semantic labels. The success of RNNs can be attributed to the fact that RNNs, if successfully trained, can relate the current prediction with input words that are several time steps away. However, RNNs are difficult to train, because of the gradient vanishing and exploding problem [13]. The problem also limits RNNs’ memory capacity because error signals may not be able to backpropagated far enough.
There have been two lines of researches to address this problem. One is to design learning algorithms that can avoid gradient exploding, e.g., using gradient clipping [14], and/or gradient vanishing, e.g., using secondorder optimization methods [15]. Alternatively, researchers have proposed more advanced model architectures, in contrast to the simple RNN that uses, e.g., Elman architecture [16]. Specifically, the long shortterm memory (LSTM) [17, 18] neural networks have three gates that control flows of error signals. The recently proposed gated recurrent neural networks (GRNN) [6] may be considered as a simplified LSTM with fewer gates.
Along this line of research on developing more advanced architectures, this paper focuses on a novel neural network architecture. Inspired by the recent work in [19], we extend the simple RNN with Elman architecture to using an external memory. The external memory stores the past hidden layer activities, not only from the current sentence but also from past sentences. To predict outputs, the model uses input observation together with a content retrieved from the external memory. The proposed model performs strongly on a common language understanding dataset and achieves new stateoftheart results.
2 Background
2.1 Language understanding
A language understanding system predicts an output sequence with tags such as namedentity given an input sequence words. Often, the output and input sequences have been aligned. In these alignments, an input may correspond to a null tag or a single tag. An example is given in Table 1.
book  a  flight  from  Hong Kong  to  Seattle 
        Dptcity    Arvcity 
Given a length input word sequence , a corresponding output tag sequence , and an alignment , the posterior probability is approximated by
(1) 
where is the size of a context window and indexes the positions in the alignment.
2.2 Simple recurrent neural networks
The above posterior probability can be computed using a RNN. A RNN consists of an input layer , a hidden layer , and an output layer . In Elman architecture [16], hidden layer activity is dependent on both the input and also recurrently on the past hidden layer activity .
Because of the recurrence, the hidden layer activity is dependent on the observation sequence from its beginning. The posterior probability is therefore computed as follows
(2)  
where the output and hidden layer activity are computed as
(3)  
(4) 
In the above equation, is softmax function and is sigmoid or tanh function. The above model is denoted as simple RNN, to contrast it with more advanced recurrent neural networks described below.
2.3 Recurrent neural networks using gating functions
The current hidden layer activity of a simple RNN is related to its past hidden layer activity via the nonlinear function in Eq. (4). The nonlinearity can cause errors backpropagated from to explode or to vanish. This phenomenon prevents simple RNN from learning patterns that are spanned with long time dependence [14].
To tackle this problem, long shortterm memory (LSTM) neural network was proposed in [17] with an introduction of memory cells, linearly dependent on their past values. LSTM also introduces three gating functions, namely input gate, forget gate and output gate. We follow a variant of LSTM in [18].
More recently, a gated recurrent neural network (GRNN) [6] was proposed. Instead of the three gating functions in LSTM, it uses two gates.
One is a reset gate that relates a candidate activation with the past hidden layer activity ; i.e.,
(5) 
where is the candidate activation. and are the matrices relate the current observation and the past hidden layer activity. is elementwise product.
The second gate is an update gate that interpolates the candidate activation and the past hidden layer activity to update the current hidden layer activity; i.e.,
(6) 
These gates are usually computed as functions of the current observation and the past hidden layer activity; i.e.,
(7)  
(8) 
where and are the weights to observation and to the past hidden layer activity for the reset gate. and are similarly defined for the update gate.
3 The RNNEM architecture
We extend simple RNN in this section to using external memory. Figure 1 illustrates the proposed model, which we denote it as RNNEM. Same as with the simple RNN, it consists of an input layer, a hidden layer and an output layer. However, instead of feeding the past hidden layer activity directly to the hidden layer as with the simple RNN, one input to the hidden layer is from a content of an external memory. RNNEM uses a weight vector to retrieve the content from the external memory to use in the next time instance. The element in the weight vector is proportional to the similarity of the current hidden layer activity with the content in the external memory. Therefore, content that is irrelevant to the current hidden layer activity has small weights. We describe RNNEM in details in the following sections. All of the equations to be described are with their bias terms, which we omit for simplicity of descriptions. We implemented RNNEM using Theano [20, 21].
3.1 Model input and output
The input to the model is a dense vector . In the context of language understanding, is a projection of input words, also known as word embedding.
The hidden layer reads both the input and a content vector from the memory. The hidden layer activity is computed as follows
(9) 
where is tanh function. is the weight to the input vector. is the content from a read operation to be described in Eq. (15). is the weight to the content vector.
The output from this model is fed into the output layer as follows
(10) 
where is the weight to the hidden layer activity and is softmax function.
Notice that in case of , the above model is simple RNN.
3.2 External memory read
RNNEM has an external memory . It can be considered as a memory with slots and each slot is a vector with m elements. Similar to the external memory in computers, the memory capacity of RNNEM may be increased if using a large .
The model generates a key vector to search for content in the external memory. Though there are many possible ways to generate the key vector, we choose a simple linear function that relates hidden layer activity as follows
(11) 
where is a linear transformation matrix. Our intuition is that the memory should be in the same space of or affine to the hidden layer activity.
We use cosine distance to compare this key vector with contents in the external memory. The weight for the th slot in memory is computed as follows
(12) 
where the above weight is normalized and sums to 1.0. is a scalar larger than 0.0. It sharpens the weight vector when is larger than 1.0. Conversely, it smooths or dampens the weight vector when is between 0.0 and 1.0. We use the following function to obtain ; i.e.,
(13) 
where maps the hidden layer activity to a scalar.
Importantly, we also use a scalar coefficient to interpolate the above weight estimate with the past weight as follows:
(14) 
This function is similar to Eq. (6) in the gated RNN, except that we use a scalar to interpolate the weight updates and the gated RNN uses a vector to update its hidden layer activity.
The memory content is retrieved from the external memory at time using
(15) 
3.3 External memory update
RNNEM generates a new content vector to be added to its memory; i.e,
(16) 
where . We use the above linear function based on the same intuition in Sec. 3.2 that the new content and the hidden layer activity are in the same space of or affine to each other.
RNNEM has a forget gate as follows
(17) 
where is an erase vector, generated as . Notice that the th element in the forget gate is zero only if both read weight and erase vector have their th element set to one. Therefore, memory cannot be forgotten if it is not to be read.
RNNEM has an update gate . It simply uses the weight as follows
(18) 
Therefore, memory is only updated if it is to be read.
With the above described two gates, the memory is updated as follows
(19) 
where transforms a vector to a diagonal matrix with diagonal elements from the vector.
4 Experiments
4.1 Dataset
In order to compare the proposed model with alternative modeling techniques, we conducted experiments on a well studied language understanding dataset, Air Travel Information System (ATIS) [22, 23, 24]. The training part of this dataset consists of 4978 sentences and 56590 words. There are 893 sentences and 9198 words for test. The number of semantic label is 127, including the common null label. We use lexicononly features in experiments.
4.2 Comparison with the past results
The input in RNNEM has a window size of 3, consisting of the current input word and its neighboring two words. We use the AdaDelta method to update gradients [25]. The maximum number of training iterations was 50. Hyper parameters for tuning included the hidden layer size , the number of memory slots , and the dimension for each memory slot . The best performing RNNEM had 100 dimensional hidden layer and 8 memory slots with 40 dimensional memory slot.
Method  F1 score 

CRF [26]  92.94 
simple RNN [4]  94.11 
CNN [27]  94.35 
LSTM [28]  94.85 
GRNN  94.82 
RNNEM  95.25 
Table 2 lists performance in F1 score of RNNEM, together with the previous best results of alternative models in the literature. Since there are no previous results from GRNN, we use our own implementation of it for this study. These results are optimal in their respective systems. The previous best result was achieved using LSTM. A change of 0.38% of F1 score from LSTM result is significant at the 90% confidence level. Results in Table 2 show that RNNEM is significantly better than the previous best result using LSTM.
4.3 Analysis on convergence and averaged performances
Model  hidden layer dimension  # of Parameters 

simple RNN  115  
LSTM  50  
GRNN  60  
RNNEM^{†}  100,40 8 

100 dimensional hidden layer, 40 dimensional slot with 8 slots.
Results in the previous sections were obtained with models using different sizes. This section further compares neural network models given that they have approximately the same number of parameters, listed in Table 3. We use AdaDelta [25] gradient update method for all these models. Figure 2 plots their training set entropy with respect to iteration numbers. To better illustrate their convergences, we have converted entropy values to their logarithms. The results show that RNNEM converges to lower training entropy than other models. RNNEM also converges faster than the simple RNN and LSTM.
We further repeated ATIS experiments for 10 times with different random seeds for these neural network models. We evaluated their performances after their convergences. Table 4 lists their averaged F1 scores, together with their maximum and minimum F1 scores. A change of 0.12% is significant at the 90% confidence level, when comparing against LSTM result. Results in Table 4 show that RNNEM, on average, significantly outperforms LSTM. The best performance by RNNEM is also significantly better than the best performing LSTM.
Method  Max  Min  Averaged 

simple RNN  94.09  93.64  93.80 
LSTM  94.81  94.62  94.73 
GRNN  94.70  94.32  94.61 
RNNEM  95.22  94.71  94.96 
4.4 Analysis on memory size
The size of the external memory is proportional to the number of memory slots . We fixed the dimension of memory slots to 40 and varied the number of slots. Table 5 lists their test set F1 scores. The best performing RNNEM was with . Notice that RNNEM with performed better than the simple RNN with 94.09% F1 score in Table 4. This can be explained as using gate functions in Eqs. (17) and (18) in RNNEM, which are absent in simple RNNs. RNNEM with also performed similarly as the gated RNN with 94.70% F1 score in Table 4, partly because of these gate functions.
Memory capacity may be measured using training set entropy. Table 5 shows that training set entropy is decreased initially with increased from 1 to 8, showing that the memory capacity of the RNNEM is improved. However, the entropy is increased with s further increased. This suggests that memory capacity of RNNEM cannot be increased simply by increasing the number of slots.
slot number  1  2  4  8  16 

F1 score  94.67  94.87  94.91  95.22  94.75 
entropy  2.23  1.96  1.91  1.90  2.05 
slot number  32  64  128  256  512 
F1 score  94.87  94.77  94.57  94.84  94.53 
entropy  2.16  2.30  2.36  3.43  6.10 

5 Related works
The RNNEM is along the same line of research in [19, 29] that uses external memory to improve memory capacity of neural networks. Perhaps the closest work is the Neural Turing Machine (NTM) work in [19], which focuses on those tasks that require simple inference and has proved its effectiveness in copy, repeat and sorting tasks. NTM requires complex models because of these tasks. The proposed model is considerably simpler than NTM and can be considered as an extension of simple RNN. Importantly, we have shown through experiments on a common language understanding dataset the promising results from using the external memory architecture.
6 Conclusions and discussions
In this paper, we have proposed a novel neural network architecture, RNNEM, that uses external memory to improve memory capacity of simple recurrent neural networks. On a common language understanding task, RNNEM achieves new stateoftheart results and performs significantly better than the previous best result using long shortterm memory neural networks. We have conducted experiments to analyze its convergence and memory capacity. These experiments provide insights for future research directions such as mechanisms of accessing memory contents and methods to increase memory capacity.
7 Acknowledgement
The authors would like to thank Shawn Tan and Kai Sheng Tai for useful discussions on NTM structure and implementation.
References
 [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
 [2] R. Collobert and J. Weston, “A unified architecture for natural language processing: deep neural networks with multitask learning,” in ICML, 2008, pp. 160–167.
 [3] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH, 2010, pp. 1045–1048.
 [4] K. Yao, G. Zweig, M. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding,” in INTERSPEECH, 2013, pp. 2524–2528.
 [5] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. M. Schwartz, and J. Makhoul, “Fast and robust neural network joint models for statistical machine translation,” in ACL, 2014, pp. 1370–1380.
 [6] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoderdecoder for statistical machine translation,” in EMNLP, 2014, pp. 1724–1734.
 [7] W. Ward et al., “The cmu air travel information service: Understanding spontaneous speech,” in Proceedings of the DARPA Speech and Natural Language Workshop, 1990, pp. 127–129.
 [8] C. Raymond and G. Riccardi, “Generative and discriminative algorithms for spoken language understanding,” in INTERSPEECH, 2007, pp. 1605–1608.
 [9] R. de Mori, “Spoken language understanding: a survey,” in ASRU, 2007, pp. 365–376.
 [10] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML, 2001, pp. 282–289.
 [11] T. Kudo and Y. Matsumoto, “Chunking with support vector machines,” in NAACL, 2001.
 [12] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. HakkaniTur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 530–539, 2015.
 [13] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
 [14] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in ICML, 2013, pp. 1310–1318.
 [15] J. Martens and I. Sutskever, “Training deep and recurrent networks with hessianfree optimization,” in Neural Networks: Tricks of the Trade  Second Edition, 2012, pp. 479–535.
 [16] J. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.
 [17] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [18] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013, pp. 6645–6649.
 [19] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” CoRR, vol. abs/1410.5401, 2014.
 [20] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
 [21] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010, oral Presentation.
 [22] D. Dahl, M. Bates, M. Brown, W. Fisher, K. HunickeSmith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg, “Expanding the scope of the ATIS task: The ATIS3 corpus,” in Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics, 1994, pp. 43–48.
 [23] Y.Y. Wang, A. Acero, M. Mahajan, and J. Lee, “Combining statistical and knowledgebased spoken language understanding in conditional models,” in COLING/ACL, 2006, pp. 882–889.
 [24] G. Tur, D. HakkaniTÃ¼r, and L. Heck, “What’s left to be understood in ATIS?” in IEEE Workshop on Spoken Language Technologies, 2010.
 [25] M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” arXiv:1212.5701, 2012.
 [26] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrentneuralnetwork architectures and learning methods for language understanding,” in INTERSPEECH, 2013.
 [27] P. Xu and R. Sarikaya, “Convolutional neural network based triangular CRF for joint intent detection and slot filling,” in ASRU, 2013, pp. 78–83.
 [28] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long shortterm memory neural networks,” in IEEE SLT, 2014.
 [29] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” submitted to ICLR, vol. abs/1410.3916, 2015. [Online]. Available: http://arxiv.org/abs/1410.3916