A Lightweight Recurrent Network for Sequence Modeling
Recurrent networks have achieved great success on various sequential tasks with the assistance of complex recurrent units, but suffer from severe computational inefficiency due to weak parallelization. One direction to alleviate this issue is to shift heavy computations outside the recurrence. In this paper, we propose a lightweight recurrent network, or LRN. LRN uses input and forget gates to handle long-range dependencies as well as gradient vanishing and explosion, with all parameter-related calculations factored outside the recurrence. The recurrence in LRN only manipulates the weight assigned to each token, tightly connecting LRN with self-attention networks. We apply LRN as a drop-in replacement of existing recurrent units in several neural sequential models. Extensive experiments on six NLP tasks show that LRN yields the best running efficiency with little or no loss in model performance.111Source code is available at https://github.com/bzhangGo/lrn.
Biao Zhang Rico Sennrich School of Informatics, University of Edinburgh B.Zhang@ed.ac.uk, email@example.com Institute of Computational Linguistics, University of Zurich
Various natural language processing (NLP) tasks can be categorized as sequence modeling tasks, where recurrent networks (RNNs) are widely applied and contribute greatly to state-of-the-art neural systems (Yang et al., 2018; Peters et al., 2018; Zhang et al., 2018; Chen et al., 2018; Kim et al., 2019). To avoid the optimization bottleneck caused by gradient vanishing and/or explosion (Bengio et al., 1994), Hochreiter and Schmidhuber (1997) and Cho et al. (2014) develop gate structures to ease information propagation from distant words to the current position. Nevertheless, integrating these traditional gates inevitably increases computational overhead which is accumulated along token positions due to the sequential nature of RNNs. As a result, the weak parallelization of RNNs makes the benefits from improved model capacity expensive in terms of computational efficiency.
Recent studies introduce different solutions to this issue. Zhang et al. (2018) introduce the addition-subtraction twin-gated recurrent unit (ATR), which reduces the amount of matrix operations by developing parameter-shared twin-gate mechanism. Lei et al. (2018) introduce the simple recurrent unit (SRU), which improves model parallelization by moving matrix computations outside the recurrence. Nevertheless, both ATR and SRU perform affine transformations of the previous hidden state for gates, though SRU employs a vector parameter rather than a matrix parameter. In addition, SRU heavily relies on its highway component, without which the recurrent component itself suffers from weak capacity and generalization (Lei et al., 2018).
In this paper, we propose a lightweight recurrent network (LRN), which combines the strengths of ATR and SRU. The structure of LRN is simple: an input gate and a forget gate are applied to weight the current input and previous hidden state, respectively. LRN has fewer parameters than SRU, and compared to ATR, removes heavy calculations outside the recurrence, generating gates based on the previous hidden state without any affine transformation. In this way, computation inside each recurrent step is highly minimized, allowing better parallelization and higher speed.
The gate structure endows LRN with the capability of memorizing distant tokens as well as handling the gradient vanishing and explosion issue. This ensures LRN’s expressiveness and performance on downstream tasks. In addition, decomposing its recurrent structure discovers the correlation of input/forget gate with key/query in self-attention networks (Vaswani et al., 2017), where these two gates together manipulate the weight assigned to each token. We also reveal how LRN manages long-term and short-term memories with the decomposition.
We carry out extensive experiments on six NLP tasks, ranging from natural language inference, document classification, machine translation, question answering and part-of-speech tagging to language modeling. We use LRN as a drop-in replacement of existing recurrent units in different neural models without any other modification of model structure. Experimental results show that LRN outperforms SRU by 10%20% in terms of running speed, and is competitive with respect to performance and generalization compared against all existing recurrent units.
2 Related Work
Past decades have witnessed the rapid development of RNNs since the Elman structure was proposed (Elman, 1990). Bengio et al. (1994) point out that the gradient vanishing and explosion issue impedes the optimization and performance of RNNs. To handle this problem, Hochreiter and Schmidhuber (1997) develop LSTM where information and gradient from distant tokens can successfully pass through the current token via a gate structure and a memory cell. Unfortunately, the enhanced expressivity via complex gates comes at the cost of sacrificing computational efficiency, which becomes more severe when datasets are scaled up. Simplifying computation but keeping model capacity in RNNs raises a new challenge.
One direction is to remove redundant structures in LSTM. Cho et al. (2014) remove the memory cell and introduce the gated recurrent unit (GRU) with only two gates. Lee et al. (2017) introduce an additive structure to generate hidden representations with linear transformed inputs directly, though we empirically observe that non-linear activations can stabilize model training. Zhang et al. (2018) propose a twin-gate mechanism where input and forget gate are simultaneously produced from the same variables. We extend this mechanism by removing the affine transformation of previous hidden states.
Another direction is to shift recurrent matrix multiplications outside the recurrence so as to improve the parallelization of RNNs. Bradbury et al. (2016) propose the quasi-recurrent network (QRNN). QRNN factors all matrix multiplications out of the recurrence and employs a convolutional network to capture local input patterns. A minimal recurrent pooling function is used in parallel across different channels to handle global input patterns. Lei et al. (2017) apply the kernel method to simplify recurrence and show improved model capacity with deep stacked RNNs. This idea is extended to SRU (Lei et al., 2018) where a minimal recurrent component is strengthened via an external highway layer. The proposed LRN falls into this category with the advantage over SRU of the non-dependence on the highway component.
Orthogonal to the above work, recent studies also show the potential of accelerating matrix computation with low-level optimization. Diamos et al. (2016) emphasize persistent computational kernels to exploit GPU’s inverted memory hierarchy for reusing/caching purpose. Appleyard et al. (2016) upgrade NIVIDIA’s cuDNN implementation through exposing parallelism between operations within the recurrence. Kuchaiev and Ginsburg (2017) reduce the number of model parameters by factorizing or partitioning LSTM matrices. In general, all these techniques can be applied to any recurrent units to reduce computational overhead.
Our work is closely related with ATR and SRU. Although recent work shows that novel recurrent units derived from weighted finite state automata are effective without the hidden-to-hidden connection (Balduzzi and Ghifary, 2016; Peng et al., 2018), we empirically observe that including previous hidden states for gates is crucial for model capacity which also resonates with the evolution of SRU. Unlike ATR and SRU, however, we demonstrate that the affine transformation on the previous hidden state for gates is unnecessary. In addition, our model has a strong connection with self-attention networks.
3 Lightweight Recurrent Network
Given a sequence of input = with length of , LRN operates as follows222Bias terms are removed for clarity.:
where are model parameters and is an activation function, such as identity and tanh. and indicate the element-wise multiplication and sigmoid activation function, respectively. and correspond to the -th row of the projected sequence representation . We use the term q, k and v to denote the implicit correspondence to query, key and value in self-attention networks which is elaborated in the next section.
As shown in Eq. (1), all matrix-related operations are shifted outside the recurrence and can be pre-calculated, thereby reducing the complexity of the recurrent computation from to and easing model parallelization. The design of the input gate and forget gate is inspired by the twin-gate mechanism in ATR (Zhang et al., 2018). Unlike ATR, however, we eschew the affine transformation on the previous hidden state. By doing so, the previous hidden state directly offers positive contribution to the input gate but negative to the forget gate, ensuring adverse correlation between these two gates.
The current hidden state is a weighted average of the current input and the previous hidden state followed by an element-wise activation. When identity function is employed, our model shows analogous properties to ATR. However, we empirically observe that this leads to gradually increased hidden representation values, resulting in optimization instability. Unlike SRU, which controls stability through a particular designed scaling term, we replace the identity function with the tanh function, which is simple but effective.
4 Structure Decomposition
In this section, we show an in-depth analysis of LRN by decomposing the recurrent structure. With an identity activation, the -th hidden state can be expanded as follows:
where the representation of the current token is composed of all previous tokens with their contribution distinguished by both input and forget gates.
Relation with self-attention network. After grouping these gates, we observe that:
Each weight can be regarded as a query from the current token to the -th input token . This query chain can be decomposed into two parts: a key represented by and a query represented by . The former is modulated through the weight matrix , and tightly associated with the corresponding input token. Information carried by the key remains intact during the evolution of time step . In contrast, the latter, induced by the weight matrix , highly depends on the position and length of this chain, which dynamically changes between different token pairs.
The weights generated by keys and queries are assigned to values represented by and manipulated by the weight matrix . Compared with self-attention networks, LRN shows analogous weight parameters and model structure. The difference is that weights in self-attention networks are normalized across all input tokens. Instead, weights in LRN are unidirectional, unnomalized and spanned over all channels.
Memory in LRN Alternatively, we can view the gating mechanism in LRN as a memory that gradually forgets information.
Given the value representation at -th time step , the information delivered to later time step () in LRN is as follows:
The input gate indicates the moment that LRN first accesses the input token , whose value reflects the amount of information or knowledge allowed from this token. A larger input gate corresponds to a stronger input signal, thereby a large change of activating short-term memory. This information is then delivered through a forget chain where memory is gradually decayed by a forget gate at each time step. The degree of memory decaying is dynamically controlled by the input sequence itself. When a new incoming token is more informative, the forget gate would increase so that previous knowledge is erased so as to make way for new knowledge in the memory. By contrast, meaningless tokens would be simply ignored.
|Rocktäschel et al. (2016)||250K||83.50||-||-||-||-||-||-||-|
5 Gradient Analysis
Gradient vanishing and explosion are the bottleneck that impedes training of vanilla RNNs (Pascanu et al., 2013). Consider a vanilla RNN formulated as follows:
The gradient back-propagated from the -th step heavily depends on the following one-step derivation:
Due to the chain rule, the recurrent weight matrix will be repeatedly multiplied along the sequence length. Gradient vanishing/explosion results from a weight matrix with small/large norm (Pascanu et al., 2013).
In LRN, however, the recurrent weight matrix is removed. The current hidden state is generated by directly weighting the current input and the previous hidden state. The one-step derivation of Eq. (2-4) is as follows:
where and denote the derivation of Eq. (2) and Eq. (3), respectively. The difference between Eq. (9) and Eq. (10) is that the recurrent weight matrix is substituted by a more expressive component denoted as in Eq. (10). Unlike the weight matrix , the norm of is input-dependent and varies dynamically along different positions. The dependence on inputs provides LRN with the capability of avoiding gradient vanishing/explosion.
|Zhang et al. (2015)||-||6.10||-||29.16||-||40.57||-||5.26||-|
We verify the effectiveness of LRN on six diverse NLP tasks. For each task, we adopt (near) state-of-the-art neural models with RNNs handling sequence representation. We compare LRN with several cutting-edge recurrent units, including LSTM, GRU, ATR and SRU. For all comparisons, we keep the neural architecture intact and only alter the recurrent unit.333Due to possible dimension mismatch, we include an additional affine transformation on the input matrix for the highway component in SRU. In addition, we only report and compare speed statistics when all RNNs are optimally implemented where computations that can be done before the recurrence are moved outside. All RNNs are implemented without specialized cuDNN kernels. Unless otherwise stated, different models on the same task share the same set of hyperparameters.
6.1 Natural Language Inference
Settings Natural language inference reasons about the entailment relationship between a premise sentence and a hypothesis sentence. We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and treat the task as a three-way classification task. This dataset contains 549,367 premise-hypothesis pairs for training, 9,842 pairs for developing and 9,824 pairs for testing. We employ accuracy for evaluation.
We implement a variant of the word-by-word attention model (Rocktäschel et al., 2016) using Tensorflow for this task, where we stack two additional bidirectional RNNs upon the final sequence representation and incorporate character embedding for word-level representation. The pretrained GloVe (Pennington et al., 2014) word vectors are used to initialize word embedding. We also integrate the base BERT (Devlin et al., 2018) to improve contextual modeling.
We set the character embedding size and the RNN hidden size to 64 and 300 respectively. Dropout is applied between consecutive layers with a rate of 0.3. We train models within 10 epochs using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 128 and gradient norm limit of 5.0. We set the learning rate to , and apply an exponential moving average to all model parameters with a decay rate of 0.9999. These hyperparameters are tuned according to development performance.
Results Table 1 shows the test accuracy and training time of different models. Our implementation outperforms the original model where Rocktäschel et al. (2016) report an accuracy of 83.50. Overall results show that LRN achieves competitive performance but consumes the least training time. Although LSTM and GRU outperform LRN by 0.30.9 in terms of accuracy, these recurrent units sacrifice running efficiency (about 7%48%) depending on whether LN and BERT are applied. No significant performance difference is observed between SRU and LRN, but LRN has fewer model parameters and shows a speedup over SRU of 8%21%.
Models with layer normalization (LN) (Ba et al., 2016) tend to be more stable and effective. However, for LSTM, GRU and ATR, LN results in significant computational overhead (about 27%71%). In contrast, quasi recurrent models like SRU and LRN only suffer a marginal speed decrease. This is reasonable because layer normalization is moved together with matrix multiplication out of the recurrence.
Results with BERT show that contextual information is valuable for performance improvement. LRN obtains additional 4 percentage points gain with BERT and reaches an accuracy of around 89.9. This shows the compatibility of LRN with existing pretrained models. In addition, although the introduction of BERT brings in heavy matrix computation, the benefits from LRN do not disappear. LRN is still the fastest model, outperforming other recurrent units by 8%27%.
6.2 Document Classification
Settings Document classification poses challenges in the form of long-range dependencies where information from distant tokens that contribute to the correct category should be captured. We use Amazon Review Polarity (AmaPolar, 2 labels, 3.6M/0.4M for training/testing), Amazon Review Full (AmaFull, 5 labels, 3M/0.65M for training/testing), Yahoo! Answers (Yahoo, 10 labels, 1.4M/60K for training/testing) and Yelp Review Polarity (YelpPolar, 2 labels, 0.56M/38K for training/testing) from Zhang et al. (2015) for experiments. We randomly select 10% of training data for validation. Models are evaluated by test error.
We treat a document as a sequence of words. Our model is a bidirectional RNN followed by an attentive pooling layer. The word-level representation is composed of a pretrained GloVe word vector and a convolutional character vector. We use Tensorflow for implementation and do not use layer normalization. We set character embedding size to 32, RNN hidden size to 64 and dropout rate to 0.1. Model parameters are tuned by Adam optimizer with initial learning rate of . Gradients are clipped when their norm exceeds 5. We limit the maximum document length to 400 and maximum training epochs to 6. Parameters are smoothed by an exponential moving average with a decay rate of 0.9999. These hyperparameters are tuned according to development performance.
Results Table 2 summarizes the classification results. LRN achieves comparable classification performance against ATR and SRU, but slightly underperforms LSTM and GRU (-0.45-1.22). This indicates that LRN is capable of handling long-range dependencies though not as strong as complex recurrent units. Instead, the simplification endows LRN with less computational overhead than these units. Particularly, LRN accelerates the training over LSTM and SRU by about 20%, or several days of training time on GeForce GTX 1080 Ti.444We notice that ATR operates faster than SRU. This is because though in theory SRU can be highly optimized for parallelization, computational framework like Tensorflow can not handle it automatically and the smaller amount of calculation in ATR has more advantage in practice.
6.3 Machine Translation
Settings Machine translation is the task of transforming meaning from a source language to a target language. We experiment with the WMT14 English-German translation task (Bojar et al., 2014) which consists of 4.5M training sentence pairs.555Preprocessed data is available at (Zhang et al., 2018): https://drive.google.com/open?id=15WRLfle66CO1zIGKbyz0FsFmUcINyb4X. We use newstest2013 as our development set and newstest2014 as our test set. Case-sensitive tokenized BLEU score is used for evaluation.
We implement a variant of the GNMT system (Wu et al., 2016) using Tensorflow, enhanced with residual connections, layer normalization, label smoothing, a context-aware component (Zhang et al., 2017) and multi-head attention (Vaswani et al., 2017). Byte-pair encoding (Sennrich et al., 2016) is used to reduce the vocabulary size to 32K. We set the hidden size and embedding size to 1024. Models are trained using Adam optimizer with adaptive learning rate schedule (Chen et al., 2018). We cut gradient norm to 1.0 and set the token size to 32K. Label smoothing rate is set to 0.1.
Model Variant Apart from LRN, we develop an improved variant for machine translation that includes an additional output gate. Formally, we change the Eq. (4) to the following one:
We denote this variant oLRN. Like LRN, the added matrix transformation in oLRN can be shifted out of the recurrence, bringing in little computational overhead. The design of this output gate is inspired by the LSTM structure, which acts as a controller to adjust information flow. In addition, this gate helps stabilize the hidden activation to avoid value explosion, and also improves model fitting capacity.
Results The results in Table 3 show that translation quality of LRN is slightly worse than that of GRU (-0.02 BLEU). After incorporating the output gate, however, oLRN yields the best BLEU score of 26.73, outperforming GRU (+0.45 BLEU). In addition, the training time results in Table 3 confirm the computational advantage of LRN over all other recurrent units, where LRN speeds up over ATR and SRU by approximately 25%. For decoding, nevertheless, the autoregressive schema of GNMT disables position-wise parallelization. In this case, the recurrent unit with the least computation operations, i.e. ATR, becomes the fastest. Still, both LRN and oLRN translate sentences faster than SRU (+15%/+6%).
|Yang et al. (2018)||22M||55.97||54.44||47.69||63.33||61.45||40.68|
6.4 Reading Comprehension
Settings Reading comprehension aims at providing correct answers to a query based on a given document, which involves complex sentence matching, reasoning and knowledge association. We use the SQuAD corpus (Rajpurkar et al., 2016) for this task and adopt span-based extraction method. This corpus contains over 100K document-question-answer triples. We report exact match (EM) and F1-score (F1) on the development set for evaluation.
We employ the public available rnet model (Wang et al., 2017)666https://github.com/HKUST-KnowComp/R-Net in Tensorflow. We use the default model settings: character embedding size 8, hidden size 75, batch size 64, and Adadelta optimizer (Zeiler, 2012) with initial learning rate of 0.5. Gradient norm is cut to 5.0. We also experiment with Elmo (Peters et al., 2018), and feed the Elmo representation in before the encoding layer and after the matching layer with a dropout of 0.5.
Results Table 4 lists the EM/F1 score of different models. In this task, LRN outperforms ATR and SRU in terms of both EM and F1 score. After integrating Elmo for contextual modeling, the performance of LRN reaches the best (76.14 EM and 83.83 F1), beating both GRU and LSTM (+0.33EM, +0.71F1). As recent studies show that cases in SQuAD are dominated by local pattern matching (Jia and Liang, 2017), we argue that LRN is good at handling local dependencies.
6.5 Named Entity Recognition
Settings Named entity recognition (NER) classifies named entity mentions into predefined categories. We use the CoNLL-2003 English NER dataset (Tjong Kim Sang and De Meulder, 2003) and treat NER as a sequence labeling task. We use the standard train, dev and test split. F1 score is used for evaluation.
We adopt the bidirectional RNN with CRF inference architecture (Lample et al., 2016). We implement different models based on the public codebase in Tensorflow.777https://github.com/Hironsan/anago We use the default hyperparameter settings. Word embedding is initialized by GloVe vectors.
Results As shown in Table 6888Notice that our implementation falls behind the original model (Lample et al., 2016) because we do not use specifically trained word embedding., the performance of LRN matches that of ATR and SRU, though LSTM and GRU operate better (+1.05 and +0.79). As in the SQuAD task, the goal of NER is to detect local entity patterns and figure out the entity boundaries. However, the performance gap between LSTM/GRU and LRN in NER is significantly larger than that in SQuAD. We ascribe this to the weak model architecture and the small scale NER dataset where entity patterns are not fully captured by LRN.
6.6 Language Modeling
Settings Language modeling aims to estimate the probability of a given sequence, which requires models to memorize long-term structure of language. We use two widely used datasets, Penn Treebank (PTB) (Mikolov et al., 2010) and WikiText-2 (WT2) (Merity et al., 2016) for this task. Models are evaluated by perplexity.
We modify the mixture of softmax model (MoS) (Yang et al., 2018)999https://github.com/zihangdai/mos in PyTorch to include different recurrent units. We apply weight dropout to all recurrent-related parameters instead of only hidden-to-hidden connection. We follow the experimental settings of MoS, and manually tune the initial learning rate based on whether training diverges.
Results Table 5 shows the test perplexity of different models.101010Our re-implementation of LSTM model is worse than the original model (Yang et al., 2018) because the system is sensitive to hyperparameters, and we apply weight dropout to all LSTM parameters which makes the original best choices not optimal. In this task, LRN significantly outperforms GRU, ATR and SRU, and achieves near the same perplexity as LSTM. This shows that in spite of its simplicity, LRN can memorize long-term language structures and capture a certain degree of language variation. In summary, LRN generalizes well to different tasks and can be used as a drop-in replacement of existing recurrent units.
6.7 Ablation Study
Part of LRN can be replaced with some alternatives. In this section, we conduct ablation analysis to examine two possible designs:
The above design can be further simplified into an extreme case where the forget gate is only generated from the previous hidden state without the query vector:
We experiment with SNLI and PTB tasks. Results in Table 7 show that although the accuracy on SNLI is acceptable, gLRN and eLRN perform significantly worse on the PTB task. This suggests that these alternative structures suffer from weak generalization.
6.8 Structure Analysis
In this section, we provide a visualization to check how the gates work in LRN.
We experiment with a unidirectional LRN on the AmaPolar dataset, where the last hidden state is used for document classification. Figure 1 shows the decay curve of each token along the token position. The memory curve of each token decays over time. However, important clues that contribute significantly to the final decision, as the token “great” does, decrease slowly, as shown by the red curve. Different tokens show different decay rate, suggesting that input and forget gate are capable of learning to propagate relevant signals. All these demonstrate the effectiveness of our LRN model.
7 Conclusion and Future Work
This paper presents LRN, a lightweight recurrent network that factors matrix operations outside the recurrence and enables higher parallelization. Theoretical and empirical analysis shows that the input and forget gate in LRN can learn long-range dependencies and avoid gradient vanishing and explosion. LRN has a strong correlation with self-attention networks. Experiments on six different NLP tasks show that LRN achieves competitive performance against existing recurrent units. It is simple, effective and reaches better trade-off among parameter number, running speed, model performance and generalization.
In the future, we are interested in testing low-level optimizations of LRN, which are orthogonal to this work, such as dedicated cuDNN kernels.
We thank the reviewers for their insightful comments. Biao Zhang acknowledges the support of the Baidu Scholarship. This work has been performed using resources provided by the Cambridge Tier-2 system operated by the University of Cambridge Research Computing Service (http://www.hpc.cam.ac.uk) funded by EPSRC Tier-2 capital grant EP/P020259/1.
- Appleyard et al. (2016) Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. 2016. Optimizing performance of recurrent neural networks on gpus. arXiv preprint arXiv:1604.01946.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Balduzzi and Ghifary (2016) David Balduzzi and Muhammad Ghifary. 2016. Strongly-typed recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1292–1300. JMLR.org.
- Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.
- Bojar et al. (2014) Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Bradbury et al. (2016) James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576.
- Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86. Association for Computational Linguistics.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Diamos et al. (2016) Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning, pages 2024–2033.
- Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031. Association for Computational Linguistics.
- Kim et al. (2019) Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. 2019. Semantic sentence matching with densely-connected recurrent and co-attentive information. AAAI19.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kuchaiev and Ginsburg (2017) Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722.
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270. Association for Computational Linguistics.
- Lee et al. (2017) Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2017. Recurrent additive networks. arXiv preprint arXiv:1705.07393.
- Lei et al. (2017) T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. 2017. Deriving neural architectures from sequence and graph kernels. In International Conference on Machine Learning (ICML).
- Lei et al. (2018) Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481. Association for Computational Linguistics.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
- Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318.
- Peng et al. (2018) Hao Peng, Roy Schwartz, Sam Thomson, and Noah A. Smith. 2018. Rational recurrences. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1203–1214. Association for Computational Linguistics.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Association for Computational Linguistics.
- Rocktäschel et al. (2016) Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. 2016. Reasoning about entailment with neural attention. In International Conference on Learning Representations (ICLR).
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
- Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 189–198. Association for Computational Linguistics.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Åukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
- Yang et al. (2018) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. 2018. Breaking the softmax bottleneck: A high-rank rnn language model. ICLR.
- Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Zhang et al. (2018) Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1789–1798, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2018) Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Neural machine translation with deep attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1.
- Zhang et al. (2017) Biao Zhang, Deyi Xiong, Jinsong Su, and Hong Duan. 2017. A context-aware recurrent encoder for neural machine translation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 25(12):2424–2432.
- Zhang et al. (2018) Biao Zhang, Deyi Xiong, jinsong su, Qian Lin, and Huiji Zhang. 2018. Simplifying neural machine translation with addition-subtraction twin-gated recurrent networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4273–4283. Association for Computational Linguistics.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 649–657, Cambridge, MA, USA. MIT Press.