Language Modeling through Long Term Memory Network
*The article was accepted to IJCNN 2019
††thanks: Â© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTM), and Memory Networks which contain memory are popularly used to learn patterns in sequential data. Sequential data has long sequences that hold relationships. RNN can handle long sequences but suffers from the vanishing and exploding gradient problems. While LSTM and other memory networks address this problem, they are not capable of handling long sequences (50 or more data points long sequence patterns). Language modelling requiring learning from longer sequences are affected by the need for more information in memory. This paper introduces Long Term Memory network (LTM), which can tackle the exploding and vanishing gradient problems and handles long sequences without forgetting. LTM is designed to scale data in the memory and gives a higher weight to the input in the sequence. LTM avoid overfitting by scaling the cell state after achieving the optimal results. The LTM is tested on Penn treebank dataset, and Text8 dataset and LTM achieves test perplexities of 83 and 82 respectively. 650 LTM cells achieved a test perplexity of 67 for Penn treebank, and 600 cells achieved a test perplexity of 77 for Text8. LTM achieves state of the art results by only using ten hidden LTM cells for both datasets.
Natural language understanding requires processing sequential data. Natural language is time-dependent, and past information can influence the current and future output. Therefore, models which are capable of processing sequential data are required. Memory determines the modelsâ capability of recalling from past information. Sequential deep learning models have shown to achieve state-of-the-art results in natural languages understanding tasks such as question answering , machine translation , and language modelling .
The memory networks have a recurrent behaviour which use outputs to influence the current output . With the increase in sequence length, the effect on the current input is reduced, and after a certain number of steps the effect on the current input becomes invisible. In order to understand a language, the model is required to learn from past knowledge. Relevant information to understand language is spread throughout the sequence. Therefore, long-term memory is required for natural language understanding .
Recurrent Neural Network (RNN)s are capable of handling long sequences but suffer from the exploding and vanishing gradient descent . In order to overcome the issue Long Short-Term Memory Networks (LSTM)s , Simple Recurrent Network  and Memory Network  clip the gradient. These models still suffer from the problem of vanishing gradient when the sequences are long. The gradients of non-linear functions are close to zero, and the gradient is back propagated through time while multiplied. When the eigenvalues are small, the gradient will converge to zero rapidly. Therefore, these models are capable of only handling short-term dependencies.
LSTM, GRU and SRN proposed by Mikolov  use gates to control the vanishing gradient problem. These gates control the vanishing gradient problem. The gates control the memory sequence and prevents the overflow of data. The forget gate in the LSTM is a crucial element which forgets the past sequence . The gates control or forget the previous sequences which influence the current input. Therefore, these memory networks do not handle long term sequences.
Holding longer sequences in memory is important in properly understanding a language and it is also necessary in many long term dependency tasks. In order to remember long sequences as well as to prevent the learning model from suffering from the vanishing gradient problem, Long Term Memory Network (LTM) is introduced in this paper. LTM does not forget the past sequences. LTM incorporates the past outputs and current inputs. LTM generalises the past sequences and gives a higher emphasis on the new inputs in order to support natural language understanding. LTM was tested for long-term memory dependency based language modelling tasks. LTM is tested on Penn Treebank and Text8 datasets and it outperformed the current state-of-the-art memory network models.
Long term memory dependencies require learning from patterns. Memory networks are used in order to learn long term dependencies . Memory networks including RNN and LSTM are used for many natural language tasks such as question answering, speech to text, language modelling and time series analysis .These memory networks have shown to achieve state-of-the-art results in benchmark datasets. However, RNN, LSTM and other memory networks perform differently from each other, and each has its own merits.
RNN is capable of handling infinite continuous sequence . It takes an input and passes the value continuously. The output is looped back and combined with the input . The long term dependencies learning fails due to exploding and vanishing gradient problem . This is due to the direct influence of the past information to the current input (1). The internal state for a current input of RNN can be defined as:
where activation can be any activation function (e.g. tanh, Relu), the weight for the current input, W being the weight for the past input state . Therefore, the overall output would be affected by the past outputs. When is added to the weight of the current input , the past state directly affects current state as shown in (1).
LSTM was introduced in order to handle the vanishing and exploding gradient problem . The forget gate was later added to the original LSTM. This is capable of preventing the internal state from growing indefinitely and handling the network break . The forget gate resets the cell state when the it decides on forgetting the past sequence. The cell state holds the past inputs with the network or resets the cell state to forget past information held in the network. LSTM has shown to be a stable model that is not affected by the vanishing and exploding gradient problem . However, the LSTM is only capable of handling short-term dependencies .
Traditional memory networks (RNN and LSTM) have shown to handle natural language understanding tasks . RNN is capable of handling continuous data streams which are entered into the network as in speech recognition  and language modelling . LSTM has shown to perform more complex tasks such as question answering . The traditional memory networks and specified memory networks (Dynamic Memory Network  and Reinforced Memory Network ) benefit learning from longer dependencies in order to understand language. Longer dependencies are captured by adding more hidden layers. The hidden layers would also contribute towards the vanishing and exploding gradient. Therefore, forgetting the past sequences is one main approach used in memory networks . This affects on long-term dependency.
The vanishing and exploding gradient is one of the most problematic issues in memory networks through backpropagation . Memory network trained by deriving the gradients of the network weights using backpropagation and chain rule. Consider a long sequence which has more than 30 words as the input, “I was born in France. I moved to UK when I was 5 years old … I speak fluent French”. Using language models the last word of the paragraph “French” requires learning through a long dependency from the first word France. Passing the paragraph through an RNN can cause the vanishing and exploding gradient problem . This problem occurs while the RNN is training. Gradients from the deeper layers have to go through matrix multiplications using Chain Rule, and if the previous layers have small values, it declines exponentially . These gradient values are insignificant to the model to learn from; this is vanishing gradient problem. If the gradient is large, it gets larger and explodes which negatively affects the modelâs training; this is the exploding gradient problem.
Clipping the gradients which places a predefined threshold value which changes the gradient length and attempts to control the vanishing and exploding gradient problem of RNN . Gradient clipping affects the convergence of the gradient. LSTM and other memory networks avoid vanishing and exploding gradient by using gates which controls the passing the past outputs to the current input . Clipping also requires a target to be defined at every time step which increases the complexity . Memory networks including LSTM forget the past outputs which the network deems irrelevant. Attention-based memory networks  avoid vanishing and exploding gradient by focusing on only a few factors which are relevant to the tasks. These methods used to avoid the vanishing and exploding gradient prevents prolonging the memory of the network. According to the example, either the sequence is long, or the model does not identify relevancy in “France”, it is removed from the memory. Model not knowing “France” would directly influence the model in predicting the last word “French”.
Long-term memory network should have the capability of holding all the past sequences and not be affected by the vanishing or exploding gradient.
Iii Proposed Methodology for Long Term Memory
The proposed model has two main objectives: 1) to handle longer sequences; and 2) to overcome the vanishing gradient. The proposed LTM is structured such that it is capable of holding and generalizing old sequences (Fig. 1) and give an emphasis on the recent information. Fig. 1. shows a single cell LTM which holds long-term memory which generalises the past sequences.
Retaining longer memory sequences is a crucial requirement in natural language understanding since the past sequences affect the current inputs . Furthermore, LTM gives an emphasis weight to the current input. The LTM holds three states:
input state: handles the current input to pass on to the output
cell state: carries the past information through each step to the other step.
output state: handles the current output and passes the output to the cell state.
The LTMâs functionality relies on the gate structure within it. The LTM cell contains four gates with the first three gates impact on the inputs and the last gate controlling and generalising the cell state. However, the LTMâs cell state does not reset itself similar to LSTMâs forget gates function . Therefore, LTM is capable of holding longer sequences in memory. The following sections provide the detail of the architecture.
Iii-a Input state
The input is combined with the previous output and passed on to the Sigmoid_1 as shown in (2). Equation (2), indicates the sigmoid functions and is the weight for the gate. The is the by-product which generates an effect on the LTM cell which depends on the current input and the previous output.
Similarly, (3) shows a similar functionality with different weight which gives a higher impact on the current input although scaled through the sigmoid functions (Sigmoid_1 and Sigmoid_2). These two equations (2) and (3) support long-term memory by emphasising on the current input and adds on to the past input. is the weight for the gate represented by (3).
In order to emphasise the current input to effect on the output and are passed through a dot operation to create . is created as showin in (4).
amplifies the effect of the current input and past output. is passed on to the cell state, which would be carried along to the future sequences. amplifies the current inputsâ effect on the output.
Iii-B Cell state
Cell state similar to LSTM’s cell state  carries forward the past outputs to the present cell. Natural language understanding requires both past output and current inputs. The current input is emphasised over the past outputs. Therefore, has a higher value combining the current input which is passed on to the cell state as shown in (5). Therefore, the output would have a higher effect on the current input. As shown in (5), , the current cell state combines the current input and the past output .
The final cell state as shown in (6) is calculated using the and passing through the Sigmoid_4. Through this, the LTM scales the cell state . The cell state carries on a scaled value to the final output state. is the weight for the (6).
Iii-C Output state
The cell state and the are joined together and combined through the dot operation. The and the create the final output . Equation (8) shows the final output creation. has a higher impact through the current input as well as the past outputs. Therefore, the impact from both the past and the current input are combined as shown in (8).
The output and , is passed on to the next time step, which is shown in (Fig. 3). LTM is used as a cell, and the cell passes the and . This also shows how the cells passes the past outputs on and combine with the current inputs.
In order to demonstrate the long-term dependency learning, LTM is tested on language modelling. Three types of experiments are conducted to evaluate the LTM using Penn treebank dataset and Text8 dataset. Penn treebank dataset contains 2499 stories of Wall Street Journal. These stories are in raw text format. Text8 dataset contains over 240000 Wikipedia articles. Articles from both datasets contain long relationship dependencies between words. LTM is evaluated on the two datasets against the current state of the art models, and finally, LTM is evaluated against itself by changing the number of cells to find the best cell size which generates the best results.
LTM was first evaluated on Pennbank dataset . Similar to Mikolov et al. model , it consists of pre-processing the data and the training size of 930K tokens, validating the size of 74K tokens and testing size of 82K tokens. The dataset has a vocabulary of 10K words. In order to match with the current state-of-the-art model experiments, 300 LTM cells are used.
Second dataset Text8  has 44K vocabulary from Wikipedia. The dataset has 15.3m training tokens, 848K validation tokens and 855K test tokens. The settings are similar to . Words which occur ten times or lower are placed as an unknown token. 500 LTM cells are used in the experiments.
In order to evaluate the model on its performance, the cell number is gradually increased and tested for both Pennbank dataset and Text8. The experiment conditions are the same as the above experiments except for the number of layers. All the learning models on the Penn Treebank dataset follow similar  and experiments on Text8 follows  this includes the inputs with the hyper-parameters.
LTMâs long term memory is tested on Penn Treebank dataset and Text8 dataset. The results are validated using perplexity shown in (9). Perplexity is the inverse probability of the test set, normalized by the number of words. The lower the perplexity the better the model.
The first experiment was based on the Penn treebank dataset. Results are shown in Table I. LTM is tested against the traditional memory and recurrent networks and the current state of the art models (Delta-RNN). RNN which had the lowest performance over the tested models with 300 hidden layers achieved a test perplexity of 129. This demonstrates that RNN is not capable of handling long-term dependencies. Although LSTM has outperformed the RNN, ultra-specific models which handle long-term memory outperforms the generalised models on long-term memory. LTM achieves a test perplexity 83 with 300 units, which is 20 points above the current state of the art results. Furthermore, LTM achieves the state of the art results at ten hidden layers (Table III).
LTM was also tested with the Text8 dataset with 500 hidden layers. The LTM was compared against the traditional memory networks and the current state of the art models (MemNet) (Table II). LTM has outperformed all the state of the art model by only using ten hidden units (Table III). The ultra-specified long-term dependency based memory networks have shown to outperform the generic memory networks.
LTM was tested on Text8 and Penn treebank dataset by increasing its hidden layers in order to identify the best performing number of hidden layers. Table III shows validation and testing perplexity for Text8 and Penn Treebank while increasing the hidden layers. Table III also shows that LTM achieves the state of the art results with only ten hidden layers, in which other networks require 300 hidden layers or more to achieve state of the art results. The results are further improved by increasing the number of hidden layers. LTM achieved the best results for the Penn treebank dataset with 650 hidden layers. Furthermore, LTM achieved its best results for Text8 with 600 hidden layers.
|# hidden layers||Validate Perpl.||Test Perpl.|
|# hidden layers||Validate Perpl.||Test Perpl.|
|# hidden||Text8||Penn Treebank|
|layers||Train Perpl.||Test Perpl.||Train Perpl.||Test Perpl.|
The structure of the LTM, as shown in Fig.1. is designed in order to hold the inputs passed through the LTM cell and scale the output. The use of the sigmoid functions is a crucial aspect of maintaining a scaled output. Equation 6 is used to create the cell state and the output. The use of the sigmoid function in equation 6 scales the cell state in order to prevent exploding or vanishing gradient problem. Since the cell state is scaled and passed on from one-time stamp to the other time stamp the cell state value would not explode or vanish preventing the vanishing or exploding gradient. Vanishing and exploding gradient is the main reason for a memory network to forget or underperform. In order to prevent exploding or vanishing, gradient LSTM introduced the forget gate . Using the forget gate the LSTM can handle longer sequences and forget the sequence when irreverent sequences are presented to the LSTM. However, the past sequences although not substantially relevant have an effect in long-term natural language understanding tasks. LSTM has a downfall in long-term memory. LTM scales the outputs and holds it in the memory. Therefore, even the long dependencies would affect the final output of the LTM.
LTM gives a high impact on the new inputs (4). LTM combines and in order to pass a higher impact from the current input to the output as shown in (4). Therefore, the LTM gives a higher priority to the new inputs, which is more relevant to the current output. Equation 8 shows the effect on the final output which combines both the processed input and the cell state, which carries the past sequential information.
Language modelling is one evaluation method to analyse the long-term dependencies of LTM. The Penn treebank and Text8 datasets require longer learning capabilities. Language modelling requires a clear understanding of the entire text, rather than a window of text. Holding an entire article in order to predict and understand text is easier for the model. LTM through scaling holds all the information passed through the LTM. Therefore, LTM is capable of understanding a clear picture of the entire article. Attention-based memory networks  identify the most relevant information and the network predicts based on the information the attention has capture. Attention-based memory networks are capable of handling shorter sequences. It failed to hold long sequence. The attention diverts when given longer sequences. LTM does not focus on memory and holds all past inputs. Unlike attention based networks would forget the most irreverent information which might be relevant later on the sequence, LTM would hold all the information passed through the model.
Table I and Table II compare the LTM with other state-of-the-art models and traditional memory networks. This shows that LTM is capable of handling longer sequences and produces state of the art results. LTM’s longer memory plays a crucial role in language modelling tasks. Table III shows that increasing LTM cells would further enhance the results and produce lower perplexity score. LTM has shown to hold longer sequences and be unaffected by vanishing and exploding gradient.
Similar to LSTM, LTM avoids vanishing or exploding gradient decent using gates. LTM uses gates to enhance the input passed to the network. LTM handles long-term dependencies by the use sigmoid functions to scale the new inputs and carry on the past outputs at the gates. LTM handles long sequences through the scaling. The example of “I was born in France. I moved to UK when I was 5 years old … I speak fluent French” predicting “French” is attainable since the model holds the entire sequence. Holding the entire sequence in the memory supports the model to predict the last word “French”. LTM carries forward the entire sequence allowing the models to use the entirety of the sequence to predict the final word, which holds the most important factor that requires the model to predict the last word. LTM is capable of handling vanishing and exploding gradient as well as handling long-term dependencies.
Fig. 3 shows the LSTM cell which holds three gates (forget gate, input gate and output gate). LSTM holds a combination of sigmoid and tanh activation fucntions, while LTM relies only on sigmoid. Comparing Fig. 1 with Fig. 3 indicates the core difference between LSTM and LTM. LTM uses generalization through the sigmoid activation functions hold a longer sequence without forgetting the past information. However, LSTM forgets longer sequences through the forget gates in order to maintain the networks stability. LSTM sacrifies long term dependencies for network stability.
This paper presents a long-term memory network which is capable of handling long-term dependencies. LTM is capable of handling long sequences without being affected by vanishing or exploding gradient. LTM has shown to outperform traditional LSTM and RNN as well as the memory specific networks in language modelling. LTM was tested on both Penn treebank and Text8 dataset in which LTM has outperformed all state of the art memory networks using minimal hidden units. Increasing the number of hidden units have shown that the LTM does not get affected by the vanishing and exploding gradient. Adding more hidden unit the LTM has achieved lower perplexity scores and stabilised.
This work was partially supported by a Murdoch University internal grant on the high-power computer.
- Nugaliyadde et al.  A. Nugaliyadde, K. W. Wong, F. Sohel, and H. Xie, “Reinforced memory network for question answering,” in International Conference on Neural Information Processing. Springer, 2017, pp. 482–490.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- Yang et al.  Z. Yang, W. Chen, F. Wang, and B. Xu, “Multi-sense based neural machine translation,” in Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 3491–3497.
- Mikolov et al.  T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, “Learning longer memory in recurrent neural networks,” arXiv preprint arXiv:1412.7753, 2014.
- Ororbia II et al.  A. G. Ororbia II, T. Mikolov, and D. Reitter, “Learning simpler language models with the differential state framework,” Neural computation, vol. 29, no. 12, pp. 3327–3352, 2017.
- Singh and Lee  M. D. Singh and M. Lee, “Temporal hierarchies in multilayer gated recurrent neural networks for language models,” in Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 2152–2157.
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- Weston et al.  J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.
- Bengio et al.  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
- Gers et al.  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
- Nugaliyadde et al.  A. Nugaliyadde, K. W. Wong, F. Sohel, and H. Xie, “Enhancing semantic word representations by embedding deeper word relationships,” arXiv preprint arXiv:1901.07176, 2019.
- Graves et al.  A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
- Sukhbaatar et al.  S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in neural information processing systems, 2015, pp. 2440–2448.
- Weston et al.  J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” arXiv preprint arXiv:1502.05698, 2015.
- Boukoros et al.  S. Boukoros, A. Nugaliyadde, A. Marnerides, C. Vassilakis, P. Koutsakis, and K. W. Wong, “Modeling server workloads for campus email traffic using recurrent neural networks,” in International Conference on Neural Information Processing. Springer, 2017, pp. 57–66.
- Pascanu and Bengio  R. Pascanu and Y. Bengio, “Learning to deal with long-term dependencies,” Neural Computation, vol. 9, pp. 1735–1780, 1986.
- Salehinejad  H. Salehinejad, “Learning over long time lags,” arXiv preprint arXiv:1602.04335, 2016.
- Pascanu et al.  R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning, 2013, pp. 1310–1318.
- Hochreiter et al.  S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.
- Young et al.  T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” ieee Computational intelligenCe magazine, vol. 13, no. 3, pp. 55–75, 2018.
- Chen et al.  S.-H. Chen, S.-H. Hwang, and Y.-R. Wang, “An rnn-based prosodic information synthesizer for mandarin text-to-speech,” IEEE transactions on speech and audio processing, vol. 6, no. 3, pp. 226–239, 1998.
- Mikolov et al.  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
- Kumar et al.  A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher, “Ask me anything: Dynamic memory networks for natural language processing,” in International Conference on Machine Learning, 2016, pp. 1378–1387.
- LeCun et al.  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
- Sak et al.  H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
- Chorowski et al.  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
- Cambria and White  E. Cambria and B. White, “Jumping nlp curves: A review of natural language processing research,” IEEE Computational intelligence magazine, vol. 9, no. 2, pp. 48–57, 2014.
- Taylor et al.  A. Taylor, M. Marcus, and B. Santorini, “The penn treebank: an overview,” in Treebanks. Springer, 2003, pp. 5–22.
- Mikolov et al.  T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011, pp. 5528–5531.
- Xie et al.  Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, and A. Y. Ng, “Data noising as smoothing in neural network language models,” arXiv preprint arXiv:1703.02573, 2017.