# Feedforward Sequential Memory Neural Networks without Recurrent Feedback

###### Abstract

We introduce a new structure for memory neural networks, called feedforward sequential memory networks (FSMN), which can learn long-term dependency without using recurrent feedback. The proposed FSMN is a standard feedforward neural networks equipped with learnable sequential memory blocks in the hidden layers. In this work, we have applied FSMN to several language modeling (LM) tasks. Experimental results have shown that the memory blocks in FSMN can learn effective representations of long history. Experiments have shown that FSMN based language models can significantly outperform not only feedforward neural network (FNN) based LMs but also the popular recurrent neural network (RNN) LMs.

Feedforward Sequential Memory Neural Networks without Recurrent Feedback

Shiliang Zhang, Hui Jiang, Si Wei, Lirong Dai National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China, Hefei, Anhui, China Department of Electrical Engineering and Computer Science York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada IFLYTEK Research, Hefei, Anhui, China zsl2008@mail.ustc.edu.cn, hj@cse.yorku.ca, siwei@iflytek.com, lrdai@ustc.edu.cn

## 1 Introduction

When machine learning methods are applied to model sequential data such as text, speech and video, it is very important to take advantage of the long-term dependency. Traditional approaches have explored to capture the long-term structure within the sequential data using recurrent feedbacks such as in regular recurrent neural networks (RNNs) or LSTM-based models. [1, 2, 3, 4]. RNNs can learn and carry out complicated transformations of data over extended periods of time and store the memory in the weights of the network. Therefore, they are gaining more and more popular in sequential data modeling tasks. More recently, different from RNNs, there has also been a surge in constructing neural computing models with varying forms of explicit memory units [5, 6, 7, 8] . For example, in [6], the proposed memory networks employ a memory component that can be read from and written to. In [5], the proposed neural turing machines (NTM) improve the memory of neural networks by coupling with external memory resources, which can learn to sort a small set of numbers as well as other symbolic manipulation tasks.

In this work, we have proposed a simpler structure for memory neural networks, namely feedforward sequential memory networks (FSMN), which can learn long-term dependency in sequential data without using the recurrent feedback. For FSMN, we extend the standard feedforward neural networks by introducing memory blocks in the hidden layers. Different from RNNs, the overall FSMNs remain as a feed-forward structure so that they can be learned in much more efficient and stable ways than RNNs and LSTMs. In our work, we have evaluated the performance of FSMN on two language modeling tasks: Penn Tree Bank (PTB) and English wiki in Large Text Compression Benchmark (LTCB). In both tasks, FSMN based language models can significantly outperform not only the standard FNN-LMs but also the popular recurrent neural network (RNN) LMs by a significant margin.

## 2 Our Approach

### 2.1 Feedforward Sequential Memory Neural Networks

Feedforward sequential memory network (FSMN) is a standard feedforward neural network with single or multiple memory blocks in the hidden layer. For instance, Figure 1 (a) shows an FSMN with one memory block added into its second hidden layer. Given a sequence, , each represents an input data at time instance t. The corresponding hidden layer outputs are denoted as , each . As shown in Figure 1 (b), we can use a tapped-delay structure to encode and its previous histories into a fixed-sized representation in the memory block:

(1) |

where all coefficients form an N-dimension learnable vector , and is the activation function (sigmoid or RELU). Furthermore, as shown in Figure 1 (a), may be fed into next hidden layer in the same way as .

From the viewpoint of signal processing, the memory block in FSMN can be regarded as a high-order finite impulse response (FIR) filter while the recurrent layer in RNNs, namely , may be viewed as a first-order infinite impulse response (IIR) filter, see Figure 1 (c). Obviously, the vector may be regarded as the coefficients of an N-order FIR filter. We know IIR filters are more compact than FIR filters. However, IIR filters may be difficult to implement. In some cases, IIR filters may become unstable but FIR filters are always stable. Moreover, the learning of IIR-filter-like RNNs requires to use the so-called back-propagation through time (BPTT) which significantly increases the computational complexity of the learning and also causes the problems of gradient vanishing and exploding [10]. On the other hand, the proposed FIR-filter-like FSMNs can be efficiently learned using the standard back-propagation procedure. Therefore, the learning of FSMNs is more stable and efficient than that of RNNs.

### 2.2 Implement FSMN for language models

The goal in language modeling is to predict the next word in a text sequence given all previous words. We now explain how to implement FSMNs for this task. FSMN is a standard feedforward neural network (FNN) except the additional memory blocks. We will show that the memory block can be efficiently implemented as sentence-by-sentence matrix multiplications, which are suitable for the mini-batch based stochastic gradient descent (SGD) method running on GPUs.

Suppose the N-order FIR filter coefficients in the memory block are denoted as . For a given sentence consisting of words, we may construct a square matrix as follow:

Therefore, the sequential memory operations in eq.(1) for the whole sequence can be computed with one matrix multiplication as: .

Similarly, we may extend the idea to a mini-batch composed of sentences, , we can compute the sequential memory representation for all sentences in the mini-batch as follows:

(2) |

In the backward pass, except the weights in the network, we also need to calculate the gradients of , which is then used to update the filter coefficients . We can calculate the gradients using the standard back-propagation (BP) algorithm. Therefore, all computation in FSMNs can be formulated as large matrix multiplications, which can be efficiently conducted in GPUs. As a result, FSMN based LMs have the same computational complexity as the standard NN LMs in training, which is much more efficient than RNN-LMs.

## 3 Experiments

We have evaluated FSMNs on two benchmark LM tasks: i) the Penn Treebank (PTB) corpus of about 1M words, following the same setup as [11]. ii) The Large Text Compression Benchmark (LTCB) [12]. In LTCB, we use the enwik9 dataset, which is composed of the first bytes of enwiki-20060303-pages-articles.xml. We split it into three parts: training (153M), validation (8.9M) and test (8.9M) sets. We limit the vocabulary size to 80k for LTCB and replace all out-of-vocabulary words by UNK.

For FSMNs, the hidden units employ the rectified linear activation function, i.e., . The nets are initialized based on the normalized initialization in [13], without using any pre-training. We use SGD with a mini-batch size of 200 and 500 for PTB and LTCB tasks respectively. The initial learning rate is 0.4 and 0.002 for the weights and filter coefficients respectively, which is kept fixed as long as the perplexity on the validation set decreases by at least 1. After that, we continue six more epochs of training, where the learning rate is halved after each epoch. In PTB task, we also use momentum (0.9) and weight decay (0.00004) to avoid overfitting.

### 3.1 Results

We have first evaluated the performance of FSMN-LMs on the PTB task. We have trained FSMN with an input context window of two, where the previous two words are used to predict the next word. The FSMN contains a linear projection layer (of 200 nodes), two hidden layers (of 400 nodes pre layer) and a memory block in the first hidden layer. For the memory block, we use a 20-order FIR filter. In Table 1, we have summarized the perplexities on the PTB test set for various language models.

For the LTCB task, we have trained several baseline systems: i) two n-gram LMs (3-gram and 5-gram) using the modified Kneser-Ney smoothing without count cutoffs; ii) several traditional feedforward NNLMs with different model sizes and input context windows (bigram, trigram, 4-gram and 5-gram); iii) an RNN-LM with one hidden layer of 600 nodes using the toolkit in [15]; iv) 2nd-order FOFE based FNNLM [14] with different hidden layer sizes. Moreover, we have examined our FSMN based LMs with different architectures. We have trained a 3-hidden-layer FSMN with a memory block in the first hidden layer, second hidden layer or both. The order of the FIR filter is 30 in these experiments. In Table 2, we have summarized the perplexities on the LTCB test set for various models.

Experimental results in Table 1 and Table 2 have shown that the FSMN based LMs can significantly outperform the baseline higher-order feedforward neural network (FNN) LMs, FOFE-based FNN LMs as well as the popular RNN-based LMs by a quite significant margin.

Model | Test PPL |
---|---|

KN 5-gram [11] | 141 |

RNNLM [11] | 123 |

LSTM [2] | 117 |

MemN2N[7] | 111 |

trigram FNNLM[14] | 131 |

6-gram FNNLM[14] | 113 |

FOFE-FNNLM[14] | 108 |

FSMN-LM | 102 |

Model | Architecture | Test PPL |
---|---|---|

KN 3-gram | - | 156 |

KN 5-gram | - | 132 |

RNN-LM | [1*600]-80k | 112 |

FNNLM | [2*200]-3*600-80k | 155 |

[2*200]-1200-2*600-80k | 154 | |

FOFE-FNNLM | [2*200]-3*600-80k | 104 |

[2*200]-1200-2*600-80k | 107 | |

FSMN-LM | [2*200]-600(M)-600-600-80k | 95 |

[2*200]-600-600(M)-600-80k | 96 | |

[2*200]-600(M)-600(M)-600-80k | 92 |

## 4 Conclusions and Future Work

In this work, we have proposed a novel neural network architecture, namely feedforward sequential memory networks (FSMN), which use FIR-filter-like memory blocks in the hidden layer of standard feedforward neural networks. Experimental results on language modeling tasks have shown that the FSMN can effectively learn the long term history. For the future work, we will try to apply this model to other sequential data modeling tasks, such as acoustic modeling in speech recognition.

## References

- [1] S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural computation, 1997, 9(8): 1735-1780.
- [2] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
- [3] J. Chung, C. Gulcehre, K. H. Cho, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.
- [4] T. Mikolov, A. Joulin, S. Chopra, et al. Learning longer memory in recurrent neural networks. arXiv:1412.7753, 2014.
- [5] A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014.
- [6] J. Weston, S. Chopra, A. Bordes. Memory networks. arXiv:1410.3916, 2014.
- [7] S. Sukhbaatar, A. Szlam, J. Weston. End-To-End Memory Networks. arXiv:1503.08895, 2015.
- [8] A. Joulin, T. Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. arXiv:1503.01007, 2015.
- [9] A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory Networks. arXiv:1506.02075, 2015.
- [10] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks. volume 5, no 2, pages 157-166, 1994.
- [11] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky and S. Khudanpur. Extensions of recurrent neural network language model. In Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5528-5531, 2011.
- [12] M. Mahoney. Large Text Compression Benchmark. In http://mattmahoney.net/dc/textdata.html, 2011.
- [13] G. Xavier, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. of AISTATS, 2010.
- [14] S. Zhang, H. Jiang, M. Xu, J. Hou and L.R. Dai. The Fixed-Size Ordinally-Forgetting Encoding Method for Neural Network Language Models. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pp.495-500, 2015.
- [15] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ and S. Khudanpur. Recurrent neural network based language model. In Proc. of Interspeech, pages 1045-1048, 2010.