Feedforward Sequential Memory Neural Networks without Recurrent Feedback

Feedforward Sequential Memory Neural Networks without Recurrent Feedback

Shiliang Zhang, Hui Jiang, Si Wei, Lirong Dai
National Engineering Laboratory for Speech and Language Information Processing
University of Science and Technology of China, Hefei, Anhui, China
Department of Electrical Engineering and Computer Science
York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada
IFLYTEK Research, Hefei, Anhui, China
zsl2008@mail.ustc.edu.cn, hj@cse.yorku.ca, siwei@iflytek.com, lrdai@ustc.edu.cn
Abstract

We introduce a new structure for memory neural networks, called feedforward sequential memory networks (FSMN), which can learn long-term dependency without using recurrent feedback. The proposed FSMN is a standard feedforward neural networks equipped with learnable sequential memory blocks in the hidden layers. In this work, we have applied FSMN to several language modeling (LM) tasks. Experimental results have shown that the memory blocks in FSMN can learn effective representations of long history. Experiments have shown that FSMN based language models can significantly outperform not only feedforward neural network (FNN) based LMs but also the popular recurrent neural network (RNN) LMs.

 

Feedforward Sequential Memory Neural Networks without Recurrent Feedback


  Shiliang Zhang, Hui Jiang, Si Wei, Lirong Dai National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China, Hefei, Anhui, China Department of Electrical Engineering and Computer Science York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada IFLYTEK Research, Hefei, Anhui, China zsl2008@mail.ustc.edu.cn, hj@cse.yorku.ca, siwei@iflytek.com, lrdai@ustc.edu.cn

1 Introduction

When machine learning methods are applied to model sequential data such as text, speech and video, it is very important to take advantage of the long-term dependency. Traditional approaches have explored to capture the long-term structure within the sequential data using recurrent feedbacks such as in regular recurrent neural networks (RNNs) or LSTM-based models. [1, 2, 3, 4]. RNNs can learn and carry out complicated transformations of data over extended periods of time and store the memory in the weights of the network. Therefore, they are gaining more and more popular in sequential data modeling tasks. More recently, different from RNNs, there has also been a surge in constructing neural computing models with varying forms of explicit memory units [5, 6, 7, 8] . For example, in [6], the proposed memory networks employ a memory component that can be read from and written to. In [5], the proposed neural turing machines (NTM) improve the memory of neural networks by coupling with external memory resources, which can learn to sort a small set of numbers as well as other symbolic manipulation tasks.

In this work, we have proposed a simpler structure for memory neural networks, namely feedforward sequential memory networks (FSMN), which can learn long-term dependency in sequential data without using the recurrent feedback. For FSMN, we extend the standard feedforward neural networks by introducing memory blocks in the hidden layers. Different from RNNs, the overall FSMNs remain as a feed-forward structure so that they can be learned in much more efficient and stable ways than RNNs and LSTMs. In our work, we have evaluated the performance of FSMN on two language modeling tasks: Penn Tree Bank (PTB) and English wiki in Large Text Compression Benchmark (LTCB). In both tasks, FSMN based language models can significantly outperform not only the standard FNN-LMs but also the popular recurrent neural network (RNN) LMs by a significant margin.

2 Our Approach

2.1 Feedforward Sequential Memory Neural Networks

Figure 1: Illustration of feedforward sequential memory networks and comparison with RNNs.

Feedforward sequential memory network (FSMN) is a standard feedforward neural network with single or multiple memory blocks in the hidden layer. For instance, Figure 1 (a) shows an FSMN with one memory block added into its second hidden layer. Given a sequence, , each represents an input data at time instance t. The corresponding hidden layer outputs are denoted as , each . As shown in Figure 1 (b), we can use a tapped-delay structure to encode and its previous histories into a fixed-sized representation in the memory block:

(1)

where all coefficients form an N-dimension learnable vector , and is the activation function (sigmoid or RELU). Furthermore, as shown in Figure 1 (a), may be fed into next hidden layer in the same way as .

From the viewpoint of signal processing, the memory block in FSMN can be regarded as a high-order finite impulse response (FIR) filter while the recurrent layer in RNNs, namely , may be viewed as a first-order infinite impulse response (IIR) filter, see Figure 1 (c). Obviously, the vector may be regarded as the coefficients of an N-order FIR filter. We know IIR filters are more compact than FIR filters. However, IIR filters may be difficult to implement. In some cases, IIR filters may become unstable but FIR filters are always stable. Moreover, the learning of IIR-filter-like RNNs requires to use the so-called back-propagation through time (BPTT) which significantly increases the computational complexity of the learning and also causes the problems of gradient vanishing and exploding [10]. On the other hand, the proposed FIR-filter-like FSMNs can be efficiently learned using the standard back-propagation procedure. Therefore, the learning of FSMNs is more stable and efficient than that of RNNs.

2.2 Implement FSMN for language models

The goal in language modeling is to predict the next word in a text sequence given all previous words. We now explain how to implement FSMNs for this task. FSMN is a standard feedforward neural network (FNN) except the additional memory blocks. We will show that the memory block can be efficiently implemented as sentence-by-sentence matrix multiplications, which are suitable for the mini-batch based stochastic gradient descent (SGD) method running on GPUs.

Suppose the N-order FIR filter coefficients in the memory block are denoted as . For a given sentence consisting of words, we may construct a square matrix as follow:

Therefore, the sequential memory operations in eq.(1) for the whole sequence can be computed with one matrix multiplication as: .

Similarly, we may extend the idea to a mini-batch composed of sentences, , we can compute the sequential memory representation for all sentences in the mini-batch as follows:

(2)

In the backward pass, except the weights in the network, we also need to calculate the gradients of , which is then used to update the filter coefficients . We can calculate the gradients using the standard back-propagation (BP) algorithm. Therefore, all computation in FSMNs can be formulated as large matrix multiplications, which can be efficiently conducted in GPUs. As a result, FSMN based LMs have the same computational complexity as the standard NN LMs in training, which is much more efficient than RNN-LMs.

3 Experiments

We have evaluated FSMNs on two benchmark LM tasks: i) the Penn Treebank (PTB) corpus of about 1M words, following the same setup as [11]. ii) The Large Text Compression Benchmark (LTCB) [12]. In LTCB, we use the enwik9 dataset, which is composed of the first bytes of enwiki-20060303-pages-articles.xml. We split it into three parts: training (153M), validation (8.9M) and test (8.9M) sets. We limit the vocabulary size to 80k for LTCB and replace all out-of-vocabulary words by UNK.

For FSMNs, the hidden units employ the rectified linear activation function, i.e., . The nets are initialized based on the normalized initialization in [13], without using any pre-training. We use SGD with a mini-batch size of 200 and 500 for PTB and LTCB tasks respectively. The initial learning rate is 0.4 and 0.002 for the weights and filter coefficients respectively, which is kept fixed as long as the perplexity on the validation set decreases by at least 1. After that, we continue six more epochs of training, where the learning rate is halved after each epoch. In PTB task, we also use momentum (0.9) and weight decay (0.00004) to avoid overfitting.

3.1 Results

We have first evaluated the performance of FSMN-LMs on the PTB task. We have trained FSMN with an input context window of two, where the previous two words are used to predict the next word. The FSMN contains a linear projection layer (of 200 nodes), two hidden layers (of 400 nodes pre layer) and a memory block in the first hidden layer. For the memory block, we use a 20-order FIR filter. In Table 1, we have summarized the perplexities on the PTB test set for various language models.

For the LTCB task, we have trained several baseline systems: i) two n-gram LMs (3-gram and 5-gram) using the modified Kneser-Ney smoothing without count cutoffs; ii) several traditional feedforward NNLMs with different model sizes and input context windows (bigram, trigram, 4-gram and 5-gram); iii) an RNN-LM with one hidden layer of 600 nodes using the toolkit in [15]; iv) 2nd-order FOFE based FNNLM [14] with different hidden layer sizes. Moreover, we have examined our FSMN based LMs with different architectures. We have trained a 3-hidden-layer FSMN with a memory block in the first hidden layer, second hidden layer or both. The order of the FIR filter is 30 in these experiments. In Table 2, we have summarized the perplexities on the LTCB test set for various models.

Experimental results in Table 1 and Table 2 have shown that the FSMN based LMs can significantly outperform the baseline higher-order feedforward neural network (FNN) LMs, FOFE-based FNN LMs as well as the popular RNN-based LMs by a quite significant margin.

Table 1: Perplexities on PTB for various LMs.
Model Test PPL
KN 5-gram [11] 141
RNNLM [11] 123
LSTM [2] 117
MemN2N[7] 111
trigram FNNLM[14] 131
6-gram FNNLM[14] 113
FOFE-FNNLM[14] 108
FSMN-LM 102
Table 2: Perplexities on LTCB for various language models. (M) denotes a hidden layer with memory block.
Model Architecture Test PPL
KN 3-gram - 156
KN 5-gram - 132
RNN-LM [1*600]-80k 112
FNNLM [2*200]-3*600-80k 155
[2*200]-1200-2*600-80k 154
FOFE-FNNLM [2*200]-3*600-80k 104
[2*200]-1200-2*600-80k 107
FSMN-LM [2*200]-600(M)-600-600-80k 95
[2*200]-600-600(M)-600-80k 96
[2*200]-600(M)-600(M)-600-80k 92

4 Conclusions and Future Work

In this work, we have proposed a novel neural network architecture, namely feedforward sequential memory networks (FSMN), which use FIR-filter-like memory blocks in the hidden layer of standard feedforward neural networks. Experimental results on language modeling tasks have shown that the FSMN can effectively learn the long term history. For the future work, we will try to apply this model to other sequential data modeling tasks, such as acoustic modeling in speech recognition.

References

  • [1] S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural computation, 1997, 9(8): 1735-1780.
  • [2] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
  • [3] J. Chung, C. Gulcehre, K. H. Cho, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.
  • [4] T. Mikolov, A. Joulin, S. Chopra, et al. Learning longer memory in recurrent neural networks. arXiv:1412.7753, 2014.
  • [5] A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014.
  • [6] J. Weston, S. Chopra, A. Bordes. Memory networks. arXiv:1410.3916, 2014.
  • [7] S. Sukhbaatar, A. Szlam, J. Weston. End-To-End Memory Networks. arXiv:1503.08895, 2015.
  • [8] A. Joulin, T. Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. arXiv:1503.01007, 2015.
  • [9] A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory Networks. arXiv:1506.02075, 2015.
  • [10] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks. volume 5, no 2, pages 157-166, 1994.
  • [11] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky and S. Khudanpur. Extensions of recurrent neural network language model. In Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5528-5531, 2011.
  • [12] M. Mahoney. Large Text Compression Benchmark. In http://mattmahoney.net/dc/textdata.html, 2011.
  • [13] G. Xavier, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. of AISTATS, 2010.
  • [14] S. Zhang, H. Jiang, M. Xu, J. Hou and L.R. Dai. The Fixed-Size Ordinally-Forgetting Encoding Method for Neural Network Language Models. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pp.495-500, 2015.
  • [15] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ and S. Khudanpur. Recurrent neural network based language model. In Proc. of Interspeech, pages 1045-1048, 2010.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
21764
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description