ThickNet: Parallel Network Structure for Sequential Modeling
Abstract
Recurrent neural networks have been widely used in sequence learning tasks. In previous studies, the performance of the model has always been improved by either wider or deeper structures. However, the former becomes more prone to overfitting, while the latter is difficult to optimize. In this paper, we propose a simple new model named ThickNet, by expanding the network from another dimension: “thickness”. Multiple parallel values are obtained via more sets of parameters in each hidden state, and the maximum value is selected as the final output among parallel intermediate outputs. Notably, ThickNet can efficiently avoid overfitting, and is easier to optimize than the vanilla structures due to the large dropout affiliated with it. Our model is evaluated on four sequential tasks including adding problem, permuted sequential MNIST, text classification and language modeling. The results of these tasks demonstrate that our model can not only improve accuracy with faster convergence but also facilitate a better generalization ability.
keywords:
Recurrent Neural Networks, Sequential Learning, Natural Language Processing, Hidden State Size1 Introduction
With the availability of largescale datasets, highcapacity of various neural networks and powerful computational technology and devices, numerous challenging problems in sequential learning tasks have been solved by employing artificial neural networks. An artificial neural network is an interconnected assembly of nodes (artificial neurons), inspired by a simplification of neurons in animal brains Gurney (1997). To enable neural networks to extract richer information and learn better features, increasing network width or depth are considered to be top two options Gilboa and GurAri (2019).
According to the UniversalApproximation Theorem proposed by Cybenko Cybenko (1989) and Hornik Hornik (1991), one or more layers can universally approximate any continuous functions on compact subsets of when the width of the network is sufficiently large (large number of nodes in one layer). The hypothesis space is the set of all functions that returned by a network, and the functions with their internal parameters can be represented by the interconnection between nodes. As the hypothesis space of a network grows, the wider network can therefore learn richer structures. Theoretically, a sufficiently wide network is able to eventually memorize the corresponding output for every possible input, but there does not exist every possible inputs to train with in practical applications. Besides, more difficulties may occur when using an extremely wide network. Despite the strong memorization, the network will become more prone to overfitting and its generalization ability tends to be relatively poor.
Inside the neural networks, the size of the hypothesis space is determined by the total number of nodes. For a fixed number of nodes, there is always a basic tradeoff between its width and depth. Instead of increasing width in one layer, using networks that contains many layers with a small number of nodes per layer can be an alternative Eldan and Shamir (2015). This type of nodes layout can be revealed in various networks going from 7 layers (AlexNet Krizhevsky et al. (2012)) to even thousands of layers (ResNet He et al. (2017)).
It is noteworthy that this trend, i.e., increased depth, widely exists in visionrelated tasks and convolutional networks. It can be interpreted as multiple layers that can extract features at various levels of abstraction. Whereas for RNNs, they obtain sequence representation by recursively updating hidden units using linear transformations and nonlinear activation functions. For each step, one single cell of RNNs can only extract current input and previous hypothesis spaces. Owing to the structural limitations of RNNs, only previous features rather than various levels of abstraction are extracted repeatedly when the networks become deeper. At the same time, deeper networks make the optimization more difficult. This is the reason why RNNs are generally not as deep as ResNets with thousands of layers. Therefore, relatively shallow networks are usually used in RNNs for several tasks including text classification Bahdanau et al. (2016) and language modeling Merity et al. (2018).
Pervious researchers have the tendency to enlarge the width or depth of the networks. However, different disadvantages may appear when applying them into practice. For instance, wider networks can easily result in overfitting and increasing generalization error, while deeper networks lead to a more difficult optimization.
In this paper, we increase the parameters in one RNN cell in another smaller dimension without increasing the hidden state size (enlarging the hypothesis space), and simply define it as “ThickNet”. A cluster of values are obtained in each node of RNN by different sets of parameters, among which only the maximum values of each set are fed into the next node affiliated with the dropout for the rest of values. This maximization operation is a form of nonlinear downsampling. In order to avoid the gradient vanishing caused by the selection of the maximum, we apply batch normalization before the nonlinear activation function Ioffe and Szegedy (2015a).
We summarize our contributions in this work as follows:

We present a novel RNN structure allowing more parameters in one single node which can be filtered through maximization operation. In other words, our model can learn richer structures by increasing the thickness of each node instead of increasing the number of layers.

The maximization operation applied in our paper reduces the dimension of hypothesis space which can also be understood as the downsampling, and hence our model avoids the overfitting appearing in wider networks.

Although the proposed maximization operation leads to a higher dropout rate, gradient information is still preserved through the back propagation. All the parameters are optimized in each training step which makes the model easier to optimize.
In our experiment, we test the effectiveness of our approach on four sequence modeling tasks: the adding problem, permuted sequential MNIST, text classification and language modeling. We run extensive comparisons with multiple baseline models and achieve stateoftheart performance. Experiments show that our proposed ThickNet is easier to optimize and better in generalization.
The rest of the paper is organized as follows. The related work is reviewed in Section . The maximization operation, ThickNet and its embedding in Long ShortTerm Memory (LSTM) Hochreither and Schmidhuber (1997) are described in Section . The performance of ThickNet is evaluated in Section . The conclusion and future work are described in Section .
2 Related Work
There is a large body of work focusing on sequence modeling tasks by applying various neural networks. So far, the RNN and its variants are more suited for tasks involving sequential or temporal data, and the most widely used ones are LSTM Hochreither and Schmidhuber (1997) and Gated Recurrent Units (GRU) Cho et al. (2014). These two typical variants have been introduced to control gradient vanishing and explosion which are commonly found in long sequence RNN tasks Bengio et al. (1994). The gating mechanism in these two networks controls which part of the present inputs and previous state memory are used to update the current activation function and current state.
Recent efforts have been introduced to continuously improve the performance on dealing with long sequence, including acceleration of convergence during training process and optimization of the internal parameters. Adding an extensional state update gate can allow skip state update and thereby reducing the number of sequential operations, where computation in RNN may or may not be executed in each time step Campos et al. (2018).
In addition to changing the structure of the recurrent neural networks, works on increasing the number of nodes in width and depth have also been proposed. A simple technique called parallel cells to enhance the learning ability of RNNs has been proposed Zhu et al. where in each layer there are multiple small RNN cells rather than one single large cell. Zhen et al. proposed the Tensorized LSTM He et al. (2017) in which by increasing the tensor size and delaying the output, the network can be widened efficiently and deepened implicitly respectively. FastSlow RNN (FSRNN) Mujika et al. (2017), a novel recurrent neural network (RNN) architecture, has been introduced which combines the strengths of both multiscale RNNs and deep transition RNNs. Wide linear models and deep neural networks are also jointly trained to combine their advantages of memorization and generalization in recommendation system tasks Cheng et al. (2016).
There are other approaches offering interesting tradeoff for increasing parameters without increasing hidden state size. Recurrent highway network extends the LSTM architecture to allow steptostep transition depths larger than one, which controls the recurrent state size to remain still. Maxout nonlinearity Goodfellow et al. (2013) has always been considered as an activation function to replace oftenused ReLU or sigmoid functions in feed forward networks. Using a Maxout unit between two layers allows us to train multiple sets of parameters and then select the set with the maximum activated value. The saturating nonlinearity (tanh activation functions) in one LSTM cell can be modified by nonsaturating activation functions (Maxout units) without causing the instability of the model Gulcehre et al. (2014); Li and Wu (2015).
3 The Proposed ThickNet
In this section, we articulate our ThickNet in detail. The maximization operation is firstly presented in . ThickNet and its embedding in LSTM architecture "ThickLSTM" are then described in and , respectively.
3.1 Maximization Operation
Firstly, we introduce a downsampling method called maximization operation, which is the premise of ThickNet. Maximization operation takes vectors, as inputs, and . Then the maximization operation is defined as:
(1) 
where represents the element of . The output value of is determined by the selection of the maximum from corresponding values of a set of inputs.
This maximization operation can be considered as the an extension of Maxout units Goodfellow et al. (2013) or max pooling, which shows a similarity to select the maximum among a cluster of values.
3.2 ThickNet
Unlike traditional neural networks that map input to a point in a highdimensional space, our proposed ThickNet maps inputs to a cluster of points in space. The output is obtained by downsampling these points with the maximization operation described in . More features are acquired through multiple points, followed by a downsampling that avoids overfitting and improves generalization ability by controlling the size of the hypothesis space.
It performs multiple linear transforms and takes the maximization operation of all linear transforms in each hidden state as a role of dropout. ThickNet is an expansion of matrix multiplication which can be applied in every linear transforms of the networks, while maxout unit performs only as an alternative of nonlinearity.
For the input , we use matrices to linearly transform it to obtain an output of length and thickness . Then downsampling function along the thickness direction is performed by maximization operation:
(2) 
where denotes the matrix multiplication, represents the th row and the th column of , and represents the th element of .
As shown in Figure 1, for the input , nodes are obtained through sets of linear transformations, each node contains values. Furthermore, the maximum value in each node is selected, and the rest is dropped out. Finally, the mdimensional output vector is obtained by maximization operation mentioned in .
Owing to this operation, our proposed ThickNet have several superiorities. This structure does not only increase the size of the hypothesis spaces, but also introduces a large dropout rate through the maximization operation. Additionally, ThickNet avoids the overfitting of the training dataset happening in the wide neural network.
Even though more parameters are introduced, all the parameters are set in parallel on a onelayer network. Notably, there is no complicated chainlike derivation process during backpropagation. Moreover, the above dropout only drops partial values in a batch of data, but reserves the gradient information of the parameter. This enables the model easier to optimize compared with the neural network which expands the model in depth.
3.3 ThickLSTM
In this part, we embed our proposed ThickNet into LSTM architecture for sequential learning tasks. For these tasks, traditional RNNs update the hidden state over time with a fixed linear transformation. To extract features from more complex and variable inputs of the sequence modeling tasks, we update the hidden state of each step by introducing ThickNet. This structure allows each hidden state to be derived from a set of linear transformations to learn richer structures.
In order to avoid gradient explosion and vanishing of traditional RNN, we apply our ThickNet to a special recurrent neural network called long shortterm memory model (LSTM). The architectures of LSTM, Maxout and our proposed ThickLSTM are shown in Figure 2. We can see that in a Maxout unit one extends the parameters of the nonlinear activations and then extracts the only set of parameters with the maximum activation value, while in a ThickNet unit we multiply the sets of linear transformation and then select the maximum value to pass towards nonlinear activation functions.
The standard architecture of LSTM applies a range of repeated modules for each time step as in a RNN, and these steps in LSTM are controlled by a memory cell containing four components: the forget gate , the input gate , the output gate , and the memory cell . The gating mechanism can determine which feature gets stored or forgot from the memory based on the current input and cell state. We replace the linear transformation in each gate with our nonlinear transformation ThickNet. For node in each gate of this ThickLSTM, multiple parallel results are obtained by different sets of parameters, in which only the maximum value is passed to the next node, and the rest is dropped out. In order to avoid the gradient vanishing caused by the selection of the maximum, batch normalization is applied before the nonlinear activation function. The ThickLSTM transition functions are described as follows:
(3) 
(4) 
Where is the hyperbolic tangent function which can map value in , the denotes logistic sigmoid function where the output value is into , and ReLu is the activation function. The denotes the elementwise multiplication. denotes the thickness of each node. represents the batchnormalizing transform Ioffe and Szegedy (2015b), i.e., , where denotes the mean and denotes the variance.
We assign these values from one of the transformations of current input and previous hidden state respectively. It increases the thickness of each node instead of increasing the number of nodes in width and depth. According to gate mechanism, is the function to determine how many features from the previous memory state should be forgot. On the contrary, is the function to determine to what extent the new feature should be stored in the current memory cell. After using to generate the temporary value, we use and the preceding memory cell to combine with input gate and forget gate respectively to get the current memory cell . is to determine the output influenced by current memory cell. Moreover, we use multiplying updated memory cell to generate the current hidden state .
Intuitively, for each step of Thicknet hidden state, more parameters are involved in updating, and the maxpooling technique enables us to choose and take the initiative to dropout. Therefore, the proposed ThickNet can achieve a stronger generalization capability for diverse inputs in sequence modeling tasks.
4 Experiments
In this section, we evaluate our proposed ThickNet on four sequential tasks: the adding problem, text classification, permuted sequential MNIST and language modeling. The experiment results are also compared with the results of several stateoftheart models.
4.1 Adding Problem
The Addition Problem Arjovsky et al. (2016) is a basic simulation task for evaluating RNN models. Two vectors of length T are taken as input. The first vector is uniformly sampled over the range , while the second vector consists of two entries being with the remainder being . The final output is the dot product of two vectors. The lengths of vectors are as two different values, i.e., T= 100 and 500.
When dealing with the adding problem, the features extracted through the maximization operation is essential. In addition to maximization operation, operation can also apply other functions, such as choosing the average or random values. To test the effectiveness of choosing the maximum, we draw a comparison among these three functions. And the Figure 3(a) explicitly demonstrates that the LSTM using maximization operation converges the fastest among these three functions.
The value of thickness has also been discussed in this part. The thickness is chosen to be during the trial. From the Figure 3(b), ThickNet can converge faster with thickness , and thereby being more efficient and easier to optimization.
Onelayer Traditional RNN and LSTM with hidden sizes of are used as baseline models for experiments. The proposed ThickNet applies a onelayer neural network with the same size and thickness of . In order to draw a more comprehensive comparison, a deeper network(LSTM with layers, hidden size of ) and a wider network(LSTM with layer and hidden size of ) are included in the experiment separately.
Mean squared error (MSE) is used as the objective function and the Adam optimization method Kingma and Ba is used for training. The initial learning rate is set to . The training data and testing data are all generated randomly throughout the experiments.
The results compared with baseline models are shown in Figure 3(c) and 3(d). For short sequences (T = 100), the LSTM performs well and the proposed ThickNet can converge to a very small error even more quickly. Unlike ThickNet, the increase of the width of network will slow down the convergence. And a deeper network, as well as the traditional RNN, fails to minimize the error anymore. As the length of vectors increases (T = 500), traditional RNN and 10layer LSTM still cannot converge to the minimum error. The convergence of ThickNet is relatively quick compared with traditional LSTM and wider LSTM with hidden size of 1280.
Figuratively, the increase in the width and depth of network cannot always improve the performance in this task. In terms of wider network, the growth in hypothesis space cannot enhance the generalization ability. While for deeper network, the repetitive learning process of previous hypothesis space is not able to minimize the error. The proposed ThickNet neither amplifies the hypothesis nor increases the numbers of nodes in depth’s respect. Instead, it increases the number of parameters within each node. In this way, the generalization ability and the optimization rate can be significantly increased in this experiment.
4.2 Permuted Sequential MNIST
We evaluate our structure on sequential MNIST classification task Le et al. (2015). The model processes each image one pixel at each time step and finally predicts the label. The permuted MNIST (pMNIST) is also considered which makes the task harder. In pMNIST, the pixels are processed in a fixed random order.
MNIST  pMNIST  

IRNN  95.0%  82% 
LSTM  98.2%  88.0% 
LSTM+Recurrent dropout    92.5% 
LSTM+Recurrent batchnorm    95.4% 
LSTM+Zoneout    93.1% 
ThickNet(onelayer) 
98.6%  96.0% 
Our baseline contains traditional RNN and LSTM, a LSTM with batchnormalizing transformation Li et al. (2018) and a LSTM adding zoneout Semeniuta et al. (2016) on the recurrent connections. Each model has one layer of 100 hidden units. Our proposed ThickNet applies a onelayer network with the same size and thickness of 10. Stochastic gradient descent on minibatches of size 128, with gradient clipping at 1.0 and step rule determined by Adam with learning rate .
The results of accuracy are demonstrated in Table 1 for comparison with the baseline models. The results shown in Figure 4(a) report the value of loss function for training data and accuracy rate for test data to evaluate the results obtained from Thicknet and traditional LSTM. From the figure, the accuracy rate reaches the stateoftheart level while our model converges to the results earlier than the LSTM. The graph of loss function shows that our proposed ThickNet outperforms the traditional LSTM due to (1) faster optimization rate and (2) better performance on handling overfitting problems.
4.3 Text Classification
In this section, we evaluate our proposed ThickNet on text classification tasks. We test our model on three datasets of classic sentence classification tasks.
MR  Subj  TREC  

LSTM Bahdanau et al. (2016)  75.9%  89.3%  86.8% 
BiLSTM Bahdanau et al. (2016)  79.3%  90.5%  89.6% 
TreeLSTM Tai et al. (2015)  80.7%  91.3%  91.8% 
LRLSTM Qian et al. (2017)  81.5%  89.9%   
ThickNet(onelayer)  80.63%  93.9%  92.0% 
We implement our model using the method Adaman algorithm for firstorder gradientbased optimization of stochastic objective functions. In this task, we set the learning rate as , the dropout rate as and we also apply crossentropy loss function to evaluate our results.
The results of accuracy are demonstrated in Table 2 for comparison with the baseline models.
4.4 Language Modeling
Language modeling (LM) task is to build the essential statistical model that can capture how meaningful sentences can be constructed from individual words, and then use this trained model to predict the next word.
We test our model over the Penn Treebank (PTB) Marcus et al. (1993). The PTB data set has been considered as a central data set in language modeling task, and it does not contain capital letters, numbers or punctuation. The vocabulary list contains unique words.
In this experiment, we choose different baseline methods to evaluate and compare with our structure including: LSTM, Variational LSTM Gal and Ghahramani (2016), Pointer SentinelLSTM Merity et al. (2017), LSTM + continuous cache pointer Grave et al. (2017a), Variational LSTM + augmented loss Inan et al. (2017), Variational RHN citePundak and Sainath (2017), 4layer skip connection LSTM Melis et al. (2018), AWDLSTM Merity et al. (2018).
We implement our model using the NTASGD algorithm for training and use a batch size of for PTB. In our experiment, we set the initial learning rate as , and we followed the practice in Merity et al. (2018) to set up the other initial parameters. To improve language modeling results, we run ASGD with and hotstarted as a finetuning step, and pointer based attention models Merity et al. (2017) have been applied in our model.
Model  Valid  Test 
RNN Mikolov and Zweig (2012)    124.7 
LSTM Zaremba et al. (2014)  82.2  78.4 
CharCNN Kim et al. (2016)    78.9 
Pointer SentinelLSTM Merity et al. (2017)  72.4  70.9 
LSTM + continuous cache pointer Grave et al. (2017b)    72.1 
Variational LSTM + augmented loss Inan et al. (2017)  71.1  68.5 
Variational RHN Zilly et al. (2017)  67.9  65.4 
4layer skip connection LSTM Melis et al. (2018)  60.9  58.3 
AWDLSTM (finetune+pointer, 3 layers) Merity et al. (2018)  53.9  52.8 
ThickNet(onelayer)  56.4  54.7 
ThickNet(twolayer)  51.3  50.2 
As shown in the Table 3, our model achieves the stateoftheart level of performance on the sequential modeling task of language modeling using only one or two layers in neural networks. The result for twolayer ThickNet in terms of perplexity of language model is better compared with other baselines. Thus, our proposed ThickNet improves the accuracy of prediction by implementing thick nodes which can extract more prior features.
5 Conclusion and Future Work
In this paper, we have introduced a novel but simple architecture which can be provided flexibly in recurrent neural networks, named ThickNet. Instead of width or depth, thickness , as another dimension, is increased to keep hidden state size unchanged. Unlike previous works, maximization operation is applied which can significantly strengthen the generalization capability of our proposed ThickNet. Overall, the ThickNet has three main contributions: achieving the stateoftheart level of performance in accuracy, avoiding overfitting and easier optimization.
The selection of the maximum in each node is inspired by previous study on maxout units and max pooling, and rigorously proved by the experimental results. Thus, in the future work, we will explore our ThickNet by implementing the attention mechanism to more precisely decide which value should be selected in each node. Additionally, more recurrent networks can benefit from the ThickNet on sequential learning tasks.
References
 Arjovsky et al. (2016) Martín Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In MariaFlorina Balcan and Kilian Q. Weinberger, editors, ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1120–1128. JMLR.org, 2016.
 Bahdanau et al. (2016) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. May 2016.
 Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
 Campos et al. (2018) Víctor Campos, Brendan Jou, Xavier Giró i Nieto, Jordi Torres, and ShihFu Chang. Skip rnn: Learning to skip state updates in recurrent neural networks. In ICLR. OpenReview.net, 2018.
 Cheng et al. (2016) HengTze Cheng, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah, Levent Koc, Jeremiah Harmsen, and et al. Wide and deep learning for recommender systems. Proceedings of the 1st Workshop on Deep Learning for Recommender Systems  DLRS 2016, 2016.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, EMNLP, pages 1724–1734. ACL, 2014.
 Cybenko (1989) G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2:303–314, 1989.
 Eldan and Shamir (2015) Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. Computer Science, 2015.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, pages 1019–1027, 2016.
 Gilboa and GurAri (2019) Dar Gilboa and Guy GurAri. Wider networks learn better features, 2019.
 Goodfellow et al. (2013) Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks, 2013.
 Grave et al. (2017a) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR. OpenReview.net, 2017.
 Grave et al. (2017b) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. 2017.
 Gulcehre et al. (2014) Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio. Learnednorm pooling for deep feedforward and recurrent neural networks. Lecture Notes in Computer Science, page 530–546, 2014.
 Gurney (1997) Kevin N. Gurney. An introduction to neural networks. 1997.
 He et al. (2017) Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. Wider and deeper, cheaper and faster: Tensorized lstms for sequence learning. In NIPS, pages 1–11, 2017.
 Hochreither and Schmidhuber (1997) Sepp Hochreither and Jürgen Schmidhuber. Long shortterm memory. 1997.
 Hornik (1991) Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
 Inan et al. (2017) Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR. OpenReview.net, 2017.
 Ioffe and Szegedy (2015a) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
 Ioffe and Szegedy (2015b) Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on International Conference on Machine Learning, 2015.
 Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Characteraware neural language models. In AAAI, 2016.
 (23) Diederik P Kingma and Jimmy Lei Ba. Adam: Amethod for stochastic optimization.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1106–1114, 2012.
 Le et al. (2015) Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A simple way to initialize recurrent networks of rectified linear units. CoRR, abs/1504.00941, 2015.
 Li and Roth (2002) Xin Li and Dan Roth. Learning question classifiers. In COLING, 2002.
 Li and Wu (2015) X. Li and X. Wu. Improving long shortterm memory networks using maxout units for large vocabulary speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4600–4604, April 2015.
 Li et al. (2018) Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, and TieYan Liu. Towards binaryvalued gates for robust lstm training. In Jennifer G. Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings of Machine Learning Research, pages 3001–3010. PMLR, 2018.
 Marcus et al. (1993) MP Marcus, MA Marcinkiewicz, and B Santorini. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330, 1993.
 Melis et al. (2018) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. In ICLR. OpenReview.net, 2018.
 Merity et al. (2017) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR. OpenReview.net, 2017.
 Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. In ICLR. OpenReview.net, 2018.
 Mikolov and Zweig (2012) Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. 2012 IEEE Workshop on Spoken Language Technology, SLT 2012  Proceedings, 12 2012.
 Mujika et al. (2017) Asier Mujika, Florian Meier, and Angelika Steger. Fastslow recurrent neural networks, 2017.
 Pang and Lee (2004) Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Donia Scott, Walter Daelemans, and Marilyn A. Walker, editors, ACL, pages 271–278. ACL, 2004.
 Pang and Lee (2005) Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Kevin Knight, Hwee Tou Ng, and Kemal Oflazer, editors, ACL. The Association for Computer Linguistics, 2005.
 Pundak and Sainath (2017) Golan Pundak and Tara N. Sainath. Highwaylstm and recurrent highway networks for speech recognition. In Francisco Lacerda, editor, INTERSPEECH, pages 1303–1307. ISCA, 2017.
 Qian et al. (2017) Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan Zhu. Linguistically regularized lstms for sentiment classification. pages 1679–1689, 2017.
 Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. In Nicoletta Calzolari, Yuji Matsumoto, and Rashmi Prasad, editors, COLING, pages 1757–1766. ACL, 2016.
 Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from treestructured long shortterm memory networks. In ACL (1), pages 1556–1566. The Association for Computer Linguistics, 2015.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. Eprint Arxiv, 2014.
 (42) Danhao Zhu, Si Shen, XinYu Dai, and Jiajun Chen. Going wider: Recurrent neural network with parallel cells.
 Zilly et al. (2017) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In Doina Precup and Yee Whye Teh, editors, ICML, volume 70 of Proceedings of Machine Learning Research, pages 4189–4198. PMLR, 2017.