Recurrently Controlled Recurrent Networks
Recurrent neural networks (RNNs) such as long short-term memory and gated recurrent units are pivotal building blocks across a broad spectrum of sequence modeling problems. This paper proposes a recurrently controlled recurrent network (RCRN) for expressive and powerful sequence encoding. More concretely, the key idea behind our approach is to learn the recurrent gating functions using recurrent networks. Our architecture is split into two components - a controller cell and a listener cell whereby the recurrent controller actively influences the compositionality of the listener cell. We conduct extensive experiments on a myriad of tasks in the NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification (TREC), entailment classification (SNLI, SciTail), answer selection (WikiQA, TrecQA) and reading comprehension (NarrativeQA). Across all 26 datasets, our results demonstrate that RCRN not only consistently outperforms BiLSTMs but also stacked BiLSTMs, suggesting that our controller architecture might be a suitable replacement for the widely adopted stacked architecture.
Recurrently Controlled Recurrent Networks
Yi Tay, Luu Anh Tuan, and Siu Cheung Hui Nanyang Technological University Institute for Infocomm Research firstname.lastname@example.org email@example.com firstname.lastname@example.org
noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float
Recurrent neural networks (RNNs) live at the heart of many sequence modeling problems. In particular, the incorporation of gated additive recurrent connections is extremely powerful, leading to the pervasive adoption of models such as Gated Recurrent Units (GRU) (Cho et al., 2014) or Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) across many NLP applications (Bahdanau et al., 2014; Xiong et al., 2016; Rocktäschel et al., 2015; McCann et al., 2017). In these models, the key idea is that the gating functions control information flow and compositionality over time, deciding how much information to read/write across time steps. This not only serves as a protection against vanishing/exploding gradients but also enables greater relative ease in modeling long-range dependencies.
There are two common ways to increase the representation capability of RNNs. Firstly, the number of hidden dimensions could be increased. Secondly, recurrent layers could be stacked on top of each other in a hierarchical fashion (El Hihi and Bengio, 1996), with each layer’s input being the output of the previous, enabling hierarchical features to be captured. Notably, the wide adoption of stacked architectures across many applications (Graves et al., 2013; Sutskever et al., 2014; Wang et al., 2017; Nie and Bansal, 2017) signify the need for designing complex and expressive encoders. Unfortunately, these strategies may face limitations. For example, the former might run a risk of overfitting and/or hitting a wall in performance. On the other hand, the latter might be faced with the inherent difficulties of going deep such as vanishing gradients or difficulty in feature propagation across deep RNN layers (Zhang et al., 2016b).
This paper proposes Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture and a general purpose neural building block for sequence modeling. RCRNs are characterized by its usage of two key components - a recurrent controller cell and a listener cell. The controller cell controls the information flow and compositionality of the listener RNN. The key motivation behind RCRN is to provide expressive and powerful sequence encoding. However, unlike stacked architectures, all RNN layers operate jointly on the same hierarchical level, effectively avoiding the need to go deeper. Therefore, RCRNs provide a new alternate way of utilizing multiple RNN layers in conjunction by allowing one RNN to control another RNN. As such, our key aim in this work is to show that our proposed controller-listener architecture is a viable replacement for the widely adopted stacked recurrent architecture.
To demonstrate the effectiveness of our proposed RCRN model, we conduct extensive experiments on a plethora of diverse NLP tasks where sequence encoders such as LSTMs/GRUs are highly essential. These tasks include sentiment analysis (SST, IMDb, Amazon Reviews), question classification (TREC), entailment classification (SNLI, SciTail), answer selection (WikiQA, TrecQA) and reading comprehension (NarrativeQA). Experimental results show that RCRN outperforms BiLSTMs and multi-layered/stacked BiLSTMs on all 26 datasets, suggesting that RCRNs are viable replacements for the widely adopted stacked recurrent architectures. Additionally, RCRN achieves close to state-of-the-art performance on several datasets.
2 Related Work
RNN variants such as LSTMs and GRUs are ubiquitous and indispensible building blocks in many NLP applications such as question answering (Seo et al., 2016; Wang et al., 2017), machine translation (Bahdanau et al., 2014), entailment classification (Chen et al., 2017) and sentiment analysis (Longpre et al., 2016; Huang et al., 2017). In recent years, many RNN variants have been proposed, ranging from multi-scale models (Koutnik et al., 2014; Chung et al., 2016; Chang et al., 2017) to tree-structured encoders (Tai et al., 2015; Choi et al., 2017). Models that are targetted at improving the internals of the RNN cell have also been proposed (Xingjian et al., 2015; Danihelka et al., 2016). Given the importance of sequence encoding in NLP, the design of effective RNN units for this purpose remains an active area of research.
Stacking RNN layers is the most common way to improve representation power. This has been used in many highly performant models ranging from speech recognition (Graves et al., 2013) to machine reading (Wang et al., 2017). The BCN model (McCann et al., 2017) similarly uses multiple BiLSTM layers within their architecture. Models that use shortcut/residual connections in conjunctin with stacked RNN layers are also notable (Zhang et al., 2016b; Longpre et al., 2016; Nie and Bansal, 2017; Ding et al., 2018).
Notably, a recent emerging trend is to model sequences without recurrence. This is primarily motivated by the fact that recurrence is an inherent prohibitor of parallelism. To this end, many works have explored the possibility of using attention as a replacement for recurrence. In particular, self-attention (Vaswani et al., 2017) has been a popular choice. This has sparked many innovations, including general purpose encoders such as DiSAN (Shen et al., 2017) and Block Bi-DiSAN (Shen et al., 2018). The key idea in these works is to use multi-headed self-attention and positional encodings to model temporal information.
While attention-only models may come close in performance, some domains may still require the complex and expressive recurrent encoders. Moreover, we note that in (Shen et al., 2017, 2018), the scores on multiple benchmarks (e.g., SST, TREC, SNLI, MultiNLI) do not outperform (or even approach) the state-of-the-art, most of which are models that still heavily rely on bidirectional LSTMs (Zhou et al., 2016; Choi et al., 2017; McCann et al., 2017; Nie and Bansal, 2017). While self-attentive RNN-less encoders have recently been popular, our work moves in an orthogonal and possibly complementary direction, advocating a stronger RNN unit for sequence encoding instead. Nevertheless, it is also good to note that our RCRN model outperforms DiSAN in all our experiments.
Another line of work is also concerned with eliminating recurrence. SRUs (Simple Recurrent Units) (Lei and Zhang, 2017) are recently proposed networks that remove the sequential dependencies in RNNs. SRUs can be considered a special case of Quasi-RNNs (Bradbury et al., 2016), which performs incremental pooling using pre-learned convolutional gates. A recent work, Multi-range Reasoning Units (MRU) (Tay et al., 2018b) follows the same paradigm, trading convolutional gates with features learned via expressive multi-granular reasoning. Zhang et al. (2018) proposed sentence-state LSTMs (S-LSTM) that exchanges incremental reading for a single global state.
Our work proposes a new way of enhancing the representation capability of RNNs without going deep. For the first time, we propose a controller-listener architecture that uses one recurrent unit to control another recurrent unit. Our proposed RCRN consistently outperforms stacked BiLSTMs and achieves state-of-the-art results on several datasets. We outperform above-mentioned competitors such as DiSAN, SRUs, stacked BiLSTMs and sentence-state LSTMs.
3 Recurrently Controlled Recurrent Networks (RCRN)
This section formally introduces the RCRN architecture. Our model is split into two main components - a controller cell and a listener cell. Figure 1 illustrates the model architecture.
3.1 Controller Cell
The goal of the controller cell is to learn gating functions in order to influence the target cell. In order to control the target cell, the controller cell constructs a forget gate and an output gate which are then used to influence the information flow of the listener cell. For each gate (output and forget), we use a separate RNN cell. As such, the controller cell comprises two cell states and an additional set of parameters. The equations of the controller cell are defined as follows:
where is the input to the model at time step . are the parameters of the model where and . is the sigmoid function and is the tanh nonlinearity. is the Hadamard product. The controller RNN has two cell states denoted as and respectively. are the outputs of the unidirectional controller cell at time step . Next, we consider a bidirectional adaptation of the controller cell. Let Equations (1-6) be represented by the function , the bidirectional adaptation is represented as:
The outputs of the bidirectional controller cell are for time step . These hidden outputs act as gates for the listener cell.
3.2 Listener Cell
The listener cell is another recurrent cell. The final output of the RCRN is generated by the listener cell which is being influenced by the controller cell. First, the listener cell uses a base recurrent model to process the sequence input. The equations of this base recurrent model are defined as follows:
Similarly, a bidirectional adaptation is used, obtaining . Next, using (outputs of the controller cell), we define another recurrent operation as follows:
where and are the cell and hidden states at time step . are the parameters of the listener cell where . Note that and are the outputs of the controller cell. In this formulation, acts as the forget gate for the listener cell. Likewise acts as the output gate for the listener.
3.3 Overall RCRN Architecture, Variants and Implementation
Intuitively, the overall architecture of the RCRN model can be explained as follows: Firstly, the controller cell can be thought of as two BiRNN models which hidden states are used as the forget and output gates for another recurrent model, i.e., the listener. The listener uses a single BiRNN model for sequence encoding and then allows this representation to be altered by listening to the controller. An alternative interpretation to our model architecture is that it is essentially a ‘recurrent-over-recurrent’ model. Clearly, the formulation we have used above uses BiLSTMs as the atomic building block for RCRN. Hence, we note that it is also possible to have a simplified variant111We omit technical descriptions due to the lack of space. of RCRN that uses GRUs as the atomic block which we found to have performed slightly better on certain datasets.
For efficiency purposes, we use the cuDNN optimized version of the base recurrent unit (LSTMs/GRUs). Additionally, note that the final recurrent cell (Equation (15)) can be subject to cuda-level optimization222We adapt the cuda kernel as a custom Tensorflow op in our experiments. While the authors of SRU release their cuda-op at https://github.com/taolei87/sru, we use a third-party open-source Tensorflow version which can be found at https://github.com/JonathanRaiman/tensorflow_qrnn.git. following simple recurrent units (SRU) (Lei and Zhang, 2017). The key idea is that this operation can be performed along the dimension axis, enabling greater parallelization on the GPU. For the sake of brevity, we refer interested readers to (Lei and Zhang, 2017). Note that this form of cuda-level optimization was also performed in the Quasi-RNN model (Bradbury et al., 2016), which effectively subsumes the SRU model.
On Parameter Cost and Memory Efficency
Note that a single RCRN model is equivalent to a stacked BiLSTM of 3 layers. This is clear when we consider how two controller BiRNNs are used to control a single listener BiRNN. As such, for our experiments, when considering only the encoder and keeping all other components constant, 3L-BiLSTM has equal parameters to RCRN while RCRN and 3L-BiLSTM are approximately three times larger than BiLSTM.
This section discusses the overall empirical evaluation of our proposed RCRN model.
4.1 Tasks and Datasets
In order to verify the effectiveness of our proposed RCRN architecture, we conduct extensive experiments across several tasks333While we agree that other tasks such as language modeling or NMT would be interesting to investigate, we could not muster enough GPU resources to conduct any extra experiments. We leave this for future work. in the NLP domain.
Sentiment analysis is a text classification problem in which the goal is to determine the polarity of a given sentence/document. We conduct experiments on both sentence and document level. More concretely, we use 16 Amazon review datasets from (Liu et al., 2017), the well-established Stanford Sentiment TreeBank (SST-5/SST-2) (Socher et al., 2013) and the IMDb Sentiment dataset (Maas et al., 2011). All tasks are binary classification tasks with the exception of SST-5. The metric is the accuracy score.
The goal of this task is to classify questions into fine-grained categories such as number or location. We use the TREC question classification dataset (Voorhees et al., 1999). The metric is the accuracy score.
This is a well-established and popular task in the field of natural language understanding and inference. Given two sentences and , the goal is to determine if entails or contradicts . We use two popular benchmark datasets, i.e., the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015), and SciTail (Science Entailment) (Khot et al., 2018) datasets. This is a pairwise classsification problem in which the metric is also the accuracy score.
This is a standard problem in information retrieval and learning-to-rank. Given a question, the task at hand is to rank candidate answers. We use the popular WikiQA (Yang et al., 2015) and TrecQA (Wang et al., 2007) datasets. For TrecQA, we use the cleaned setting as denoted by Rao et al. (2016). The evaluation metrics are the MAP (Mean Average Precision) and Mean Reciprocal Rank (MRR) ranking metrics.
This task involves reading documents and answering questions about these documents. We use the recent NarrativeQA (Kočiskỳ et al., 2017) dataset which involves reasoning and answering questions over story summaries. We follow the original paper and report scores on BLEU-1, BLEU-4, Meteor and Rouge-L.
4.2 Task-Specific Model Architectures and Implementation Details
In this section, we describe the task-specific model architectures for each task.
This architecture is used for all text classification tasks (sentiment analysis and question classification datasets). We use 300D GloVe (Pennington et al., 2014) vectors with 600D CoVe (McCann et al., 2017) vectors as pretrained embedding vectors. An optional character-level word representation is also added (constructed with a standard BiGRU model). The output of the embedding layer is passed into the RCRN model directly without using any projection layer. Word embeddings are not updated during training. Given the hidden output states of the dimensional RCRN cell, we take the concatenation of the max, mean and min pooling of all hidden states to form the final feature vector. This feature vector is passed into a single dense layer with ReLU activations of dimensions. The output of this layer is then passed into a softmax layer for classification. This model optimizes the cross entropy loss. We train this model using Adam (Kingma and Ba, 2014) and learning rate is tuned amongst .
This architecture is used for entailment tasks. This is a pairwise classification models with two input sequences. Similar to the singleton classsification model, we utilize the identical input encoder (GloVe, CoVE and character RNN) but include an additional part-of-speech (POS tag) embedding. We pass the input representation into a two layer highway network (Srivastava et al., 2015) of hidden dimensions before passing into the RCRN encoder. The feature representation of and is the concatentation of the max and mean pooling of the RCRN hidden outputs. To compare and , we pass into a two layer highway network. This output is then passed into a softmax layer for classification. We train this model using Adam and learning rate is tuned amongst . We mainly focus on the encoder-only setting which does not allow cross sentence attention. This is a commonly tested setting on the SNLI dataset.
This architecture is used for the ranking tasks (i.e., answer selection). We use the model architecture from Attentive Pooling BiLSTMs (AP-BiLSTM) (dos Santos et al., 2016) as our base and swap the RNN encoder with our RCRN encoder. The dimensionality is set to . The similarity scoring function is the cosine similarity and the objective function is the pairwise hinge loss with a margin of . We use negative sampling of to train our model. We train our model using Adadelta (Zeiler, 2012) with a learning rate of .
Reading Comprehension Model
We use R-NET (Wang et al., 2017) as the base model. Since R-NET uses three Bidirectional GRU layers as the encoder, we replaced this stacked BiGRU layer with RCRN. For fairness, we use the GRU variant of RCRN instead. The dimensionality of the encoder is set to . We train both models using Adam with a learning rate of .
For all datasets, we include an additional ablative baselines, swapping the RCRN with (1) a standard BiLSTM model and (2) a stacked BiLSTM of 3 layers (3L-BiLSTM). This is to fairly observe the impact of different encoder models based on the same overall model framework.
4.3 Overall Results
This section discusses the overall results of our experiments.
On the 16 review datasets (Table 3) from (Liu et al., 2017; Zhang et al., 2018), our proposed RCRN architecture achieves the highest score on all 16 datasets, outperforming the existing state-of-the-art model - sentence state LSTMs (SLSTM) (Zhang et al., 2018). The macro average performance gain over BiLSTMs () and Stacked (2 X BiLSTM) () is also notable. On the same architecture, our RCRN outperforms ablative baselines BiLSTM by and 3L-BiLSTM by on average across 16 datasets.
Results on SST-5 (Table 3) and SST-2 (Table 3) are also promising. More concretely, our RCRN architecture achieves state-of-the-art results on SST-5 and SST-2. RCRN also outperforms many strong baselines such as DiSAN (Shen et al., 2017), a self-attentive model and Bi-Attentive classification network (BCN) (McCann et al., 2017) that also use CoVe vectors. On SST-2, strong baselines such as Neural Semantic Encoders (Munkhdalai and Yu, 2016) and similarly the BCN model are also outperformed by our RCRN model.
Finally, on the IMDb sentiment classification dataset (Table 5), RCRN achieved accuracy. Our proposed RCRN outperforms Residual BiLSTMs (Longpre et al., 2016), 4-layered Quasi Recurrent Neural Networks (QRNN) (Bradbury et al., 2016) and the BCN model which can be considered to be very competitive baselines. RCRN also outperforms ablative baselines BiLSTM () and 3L-BiLSTM ().
Our results on the TREC question classification dataset (Table 5) is also promising. RCRN achieved a state-of-the-art score of on this dataset. A notable baseline is the Densely Connected BiLSTM (Ding et al., 2018), a deep residual stacked BiLSTM model which RCRN outperforms (). Our model also outperforms BCN (+0.4%) and SRU (). Our ablative BiLSTM baselines achieve reasonably high score, posssibly due to CoVe Embeddings. However, our RCRN can further increase the performance score.
Results on entailment classification are also optimistic. On SNLI (Table 9), RCRN achieves accuracy, which is competitive to Gumbel LSTM. However, RCRN outperforms a wide range of baselines, including self-attention based models as multi-head (Vaswani et al., 2017) and DiSAN (Shen et al., 2017). There is also performance gain of over Bi-SRU even though our model does not use attention at all. RCRN also outperforms shortcut stacked encoders, which use a series of BiLSTM connected by shortcut layers. Post review, as per reviewer request, we experimented with adding cross sentence attention, in particular adding the attention of Parikh et al. (2016) on 3L-BiLSTM and RCRN. We found that they performed comparably (both at ). We did not have resources to experiment further even though intuitively incorporating different/newer variants of attention (Kim et al., 2018; Tay et al., 2018a; Chen et al., 2017) and/or ELMo (Peters et al., 2018) can definitely raise the score further. However, we hypothesize that cross sentence attention forces less reliance on the encoder. Therefore stacked BiLSTMs and RCRNs perform similarly.
The results on SciTail similarly show that RCRN is more effective than BiLSTM (). Moreover, RCRN outperforms several baselines in (Khot et al., 2018) including models that use cross sentence attention such as DecompAtt (Parikh et al., 2016) and ESIM (Chen et al., 2017). However, it still falls short to recent state-of-the-art models such as OpenAI’s Generative Pretrained Transformer (Radford et al., 2018).
Results on the answer selection (Table 9) task show that RCRN leads to considerable improvements on both WikiQA and TrecQA datasets. We investigate two settings. The first, we reimplement AP-BiLSTM and swap the BiLSTM for RCRN encoders. Secondly, we completely remove all attention layers from both models to test the ability of the standalone encoder. Without attention, RCRN gives an improvement of on both datasets. With attentive pooling, RCRN maintains a improvement in terms of MAP score. However, the gains on MRR are greater (). Notably, AP-RCRN model outperforms the official results reported in (dos Santos et al., 2016). Overall, we observe that RCRN is much stronger than BiLSTMs and 3L-BiLSTMs on this task.
Results (Table 9) show that enhancing R-NET with RCRN can lead to considerable improvements. This leads to an improvement of on all four metrics. Note that our model only uses a single layered RCRN while R-NET uses 3 layered BiGRUs. This empirical evidence might suggest that RCRN is a better way to utilize multiple recurrent layers.
Across all 26 datasets, RCRN outperforms not only standard BiLSTMs but also 3L-BiLSTMs which have approximately equal parameterization. 3L-BiLSTMs were overall better than BiLSTMs but lose out on a minority of datasets. RCRN outperforms a wide range of competitive baselines such as DiSAN, Bi-SRUs, BCN and LSTM-CNN, etc. We achieve (close to) state-of-the-art performance on SST, TREC question classification and 16 Amazon review datasets.
4.4 Runtime Analysis
This section aims to get a benchmark on model performance with respect to model efficiency. In order to do that, we benchmark RCRN along with BiLSTMs and 3 layered BiLSTMs (with and without cuDNN optimization) on different sequence lengths (i.e., ). We use the IMDb sentiment task. We use the same standard hardware (a single Nvidia GTX1070 card) and an identical overarching model architecture. The dimensionality of the model is set to with a fixed batch size of . Finally, we also benchmark a CUDA optimized adaptation of RCRN which has been described earlier (Section 3.3).
|Training Time (seconds/epoch)||Inference (seconds/epoch)|
|3 layer BiLSTM||29||50||113||244||503||12||20||38||72||150|
|1 layer BiLSTM (cuDNN)||5||6||9||14||26||2||3||4||6||10|
|3 layer BiLSTM (cuDNN)||10||14||23||42||80||4||5||9||16||32|
|RCRN (cuDNN +cuda optimized)||10||13||21||40||78||4||5||8||15||29|
Table 10 reports training/inference times of all benchmarked models. The fastest model is naturally the 1 layer BiLSTM (cuDNN). Intuitively, the speed of RCRN should be roughly equivalent to using 3 BiLSTMs. Surprisingly, we found that the cuda optimized RCRN performs consistently slightly faster than the 3 layer BiLSTM (cuDNN). At the very least, RCRN provides comparable efficiency to using stacked BiLSTM and empirically we show that there is nothing to lose in this aspect. However, we note that cuda-level optimizations have to be performed. Finally, the non-cuDNN optimized BiLSTM and stacked BiLSTMs are also provided for reference.
5 Conclusion and Future Directions
We proposed Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture and encoder for a myriad of NLP tasks. RCRN operates in a novel controller-listener architecture which uses RNNs to learn the gating functions of another RNN. We apply RCRN to a potpourri of NLP tasks and achieve promising/highly competitive results on all tasks and 26 benchmark datasets. Overall findings suggest that our controller-listener architecture is more effective than stacking RNN layers. Moreover, RCRN remains equally (or slightly more) efficient compared to stacked RNNs of approximately equal parameterization. There are several potential interesting directions for further investigating RCRNs. Firstly, investigating RCRNs controlling other RCRNs and secondly, investigating RCRNs in other domains where recurrent models are also prevalent for sequence modeling. The source code of our model can be found at https://github.com/vanzytay/NIPS2018_RCRN.
We thank the anonymous reviewers and area chair from NIPS 2018 for their constructive and high quality feedback.
- Bahdanau et al.  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bowman et al.  Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642, 2015.
- Bradbury et al.  James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. CoRR, abs/1611.01576, 2016.
- Chang et al.  Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems, pages 76–86, 2017.
- Chen et al.  Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1657–1668, 2017.
- Cho et al.  Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Choi et al.  Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Unsupervised learning of task-specific tree structures with tree-lstms. arXiv preprint arXiv:1707.02786, 2017.
- Chung et al.  Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- Danihelka et al.  Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short-term memory. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1986–1994, 2016.
- Dieng et al.  Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. Topicrnn: A recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702, 2016.
- Ding et al.  Zixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, and Jian Yang. Densely connected bidirectional lstm with applications to sentence classification. arXiv preprint arXiv:1802.00889, 2018.
- dos Santos et al.  Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. Attentive pooling networks. CoRR, abs/1602.03609, 2016.
- El Hihi and Bengio  Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In Advances in neural information processing systems, pages 493–499, 1996.
- Graves et al.  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
- Guo et al.  Hongyu Guo, Colin Cherry, and Jiang Su. End-to-end multi-view networks for text classification. arXiv preprint arXiv:1704.05907, 2017.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huang et al.  Minlie Huang, Qiao Qian, and Xiaoyan Zhu. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems (TOIS), 35(3):26, 2017.
- Johnson and Zhang  Rie Johnson and Tong Zhang. Supervised and semi-supervised text categorization using lstm for region embeddings. arXiv preprint arXiv:1602.02373, 2016.
- Khot et al.  Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In AAAI, 2018.
- Kiela et al.  Douwe Kiela, Changhan Wang, and Kyunghyun Cho. Context-attentive embeddings for improved sentence representations. arXiv preprint arXiv:1804.07983, 2018.
- Kim et al.  Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360, 2018.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Kočiskỳ et al.  Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. arXiv preprint arXiv:1712.07040, 2017.
- Koutnik et al.  Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
- Kumar et al.  Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387, 2016.
- Lei and Zhang  Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017.
- Liu et al.  Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for text classification. arXiv preprint arXiv:1704.05742, 2017.
- Longpre et al.  Shayne Longpre, Sabeek Pradhan, Caiming Xiong, and Richard Socher. A way out of the odyssey: Analyzing and combining recent insights for lstms. arXiv preprint arXiv:1611.05104, 2016.
- Looks et al.  Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181, 2017.
- Maas et al.  Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics, 2011.
- McCann et al.  Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308, 2017.
- Miyato et al.  Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
- Munkhdalai and Yu  Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. corr abs/1607.04315, 2016.
- Nie and Bansal  Yixin Nie and Mohit Bansal. Shortcut-stacked sentence encoders for multi-domain inference. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, RepEval@EMNLP 2017, Copenhagen, Denmark, September 8, 2017, pages 41–45, 2017.
- Parikh et al.  Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2249–2255, 2016.
- Pennington et al.  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543, 2014.
- Peters et al.  Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
- Radford et al.  Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Radford et al.  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
- Rao et al.  Jinfeng Rao, Hua He, and Jimmy J. Lin. Noise-contrastive estimation for answer selection with deep neural networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 1913–1916, 2016.
- Rocktäschel et al.  Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.
- Seo et al.  Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.
- Shen et al.  Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Directional self-attention network for rnn/cnn-free language understanding. arXiv preprint arXiv:1709.04696, 2017.
- Shen et al.  Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block self-attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857, 2018.
- Socher et al.  Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. Citeseer, 2013.
- Srivastava et al.  Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.
- Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Tai et al.  Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
- Tay et al.  Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. A compare-propagate architecture with alignment factorization for natural language inference. arXiv preprint arXiv:1801.00102, 2017.
- Tay et al. [2018a] Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Co-stack residual affinity networks with multi-level attention refinement for matching text sequences. arXiv preprint arXiv:1810.02938, 2018a.
- Tay et al. [2018b] Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Multi-range reasoning for machine comprehension. arXiv preprint arXiv:1803.09074, 2018b.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
- Voorhees et al.  Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82, 1999.
- Wang et al.  Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. What is the jeopardy model? A quasi-synchronous grammar for QA. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 22–32, 2007.
- Wang et al.  Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189–198, 2017.
- Wieting et al.  John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198, 2015.
- Xingjian et al.  SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
- Xiong et al.  Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. CoRR, abs/1611.01604, 2016.
- Yang et al.  Yi Yang, Wen-tau Yih, and Christopher Meek. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2013–2018, 2015.
- Yu and Munkhdalai  Hong Yu and Tsendsuren Munkhdalai. Neural tree indexers for text understanding. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 11–21, 2017.
- Zeiler  Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Zhang et al. [2016a] Rui Zhang, Honglak Lee, and Dragomir Radev. Dependency sensitive convolutional neural networks for modeling sentences and documents. arXiv preprint arXiv:1611.02361, 2016a.
- Zhang et al. [2016b] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5755–5759. IEEE, 2016b.
- Zhang et al.  Yue Zhang, Qi Liu, and Linfeng Song. Sentence-state lstm for text representation. arXiv preprint arXiv:1805.02474, 2018.
- Zhou et al.  Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. Text classification improved by integrating bidirectional lstm with two-dimensional max pooling. arXiv preprint arXiv:1611.06639, 2016.