Hierarchical Pointer Memory Network for Task Oriented Dialogue
We observe that end-to-end memory networks (MN) trained for task-oriented dialogue, such as for recommending restaurants to a user, suffer from an out-of-vocabulary (OOV) problem – the entities returned by the Knowledge Base (KB) may not be seen by the network at training time, making it impossible for it to use them in dialogue. We propose a Hierarchical Pointer Memory Network (HyP-MN), in which the next word may be generated from the decode vocabulary or copied from a hierarchical memory maintaining KB results and previous utterances. Evaluating over the dialog bAbI tasks, we find that HyP-MN drastically outperforms MN obtaining 12% overall accuracy gains. Further analysis reveals that MN fails completely in recommending any relevant restaurant, whereas HyP-MN recommends the best next restaurant 80% of the time.
Dinesh Raghu IBM Research, IIT Delhi email@example.com Nikhil Gupta IIT Delhi firstname.lastname@example.org Mausam IIT Delhi email@example.com
Dialogue systems that converse with a goal of accomplishing a specific task are referred to as task oriented dialogue systems. Some examples include restaurant reservation (Henderson et al., 2014), movie ticket booking (Li et al., 2017) and bus information system (Raux et al., 2005). Task oriented dialogues are often grounded to a knowledge-base (KB). For example, the restaurant reservation task is grounded to a KB that contains names of restaurants, along with their details.
For effective dialogue, the system should be able to gather necessary information from the user, query a KB using an API call, and utilize the retrieved results from KB to respond to the user. Recently several deep learning based solutions have been proposed for end-to-end learning of such dialogues, e.g., end to end memory net (MN) (Bordes and Weston, 2017), query-reduction net (QRN) (Seo et al., 2017), and copy augmented Seq2Seq (CopyS) (Eric and Manning, 2017).
We observe that MN and QRN handle long utterance sequences well due to their multi-hop architecture, but suffer from an out-of-vocabulary (OOV) problem. In the extreme case, when none of the relevant restaurants returned by the KB are seen at training time, MN ends up recommending completely irrelevant restaurants to the user! CopyS is a sequence to sequence model, with the additional possibility of copying a word from the input. CopyS has the potential to fix the OOV problem, but its use of a flat location addressing scheme makes it difficult for it to handle long utterance sequences, which are common in dialog.
Contributions: We propose a novel Hierarchical Pointer Memory Network (HyP-MN) architecture that builds upon the strengths of both MN and CopyS. It maintains all its context in a hierarchical memory, which has two levels of addressing, one at utterance level, and the other at word level.
HyP-MN uses an encoder-decoder architecture so that it can stitch the next utterance one word at a time. Its encoder has a multi-hop architecture over hierarchical memory allowing it to model long utterance sequences. Its decoder has the ability to generate the next word using vocabulary seen at training time, or copy it from the memory. Because the memory is hierarchical, it uses a novel hierarchical attention that multiplies utterance-level attention with word-level attention to compute the importance of each word.
Evaluation on dialog bAbI tasks (Bordes and Weston, 2017) shows that on an average HyP-MN outperforms MN by 12 and CopyS by 9 accuracy points. The maximum benefit is observed in Task 3, which evaluates the system’s ability to suggest the next best restaurant option to the user. In this task, MN never picks a relevant restaurant due to the OOV problem. CopyS does better than MN, but because of long sequences and lack of multi-hop reasoning, it picks the right restaurants only about half the time. HyP-MN, because of its combination of multi-hop encoder and hierarchical pointer based decoder, makes the correct suggestions around 80% of the time.
2 Related Work
HyP-MN architecture has a multi-hop encoder, and a pointer network decoder with a hierarchical attention over the memory. We briefly survey these three strands of related research.
Multi-hop Networks reason over a sequence of sentences fed as input. A hop refers to reading the sentences and generating a encoded-vector. Multi-hop refers to making multiple updates to the encoded-vector by iteratively reading the input. End to end memory network (MN) (Sukhbaatar et al., 2015) represents the input as a set of sentences. Here the encoded-vector is updated by adding iterative reads. Query reduction network (Seo et al., 2017) reads the sentences sequentially using an RNN like unit called the QRN unit. Dynamic memory network (Kumar et al., 2016) also reads the sentences sequentially, and also updates the encoded-vector using a recurrent cell. Gated memory network (Liu and Perez, 2017) uses a gating mechanism to update the encoded-vector.
MN (Bordes and Weston, 2017), gated MN and QRN have been used to learn task-oriented dialogues. HyP-MN has two key differences from such architectures. First, existing models select a response from a predefined list of candidates (retrieval model), whereas HyP-MN has a decoder that generates the response one word at a time. Second, the memory in HyP-MN is hierarchical, i.e., each memory element is a sequence of words vectors rather than just a single utterance vector. This enables the generator to copy any word from the memory during generation.
Pointer Networks are sequence to sequence (Seq2Seq) models, where each token in the output sequence corresponds to a token at a certain position in the input sequence (Vinyals et al., 2015). By enabling pointing in Seq2Seq models (Cho et al., 2014; Sutskever et al., 2014), the effective decode vocabulary becomes the union of the fixed decode vocabulary and the vocabulary of the input sequence. Two main methods (Gu et al., 2016; Eric and Manning, 2017) exist for incorporating pointing in standard Seq2Seq models – hard decision (Nallapati et al., 2016; Gu et al., 2016; Eric and Manning, 2017) and soft switch (See et al., 2017). The former makes a hard choice between using the pointer distribution and the decode vocabulary distribution. It usually requires the hard decision to be labeled. The latter approach learns a soft interpolation between the two distributions without explicit labels. HyP-MN employs a soft switch in its decoder.
Eric and Manning Eric and Manning (2017) use a copy augmented Seq2seq model for learning task oriented dialogues. This approach uses a hard decision to pick between the generate and pointer distributions. This model is explicitly trained to only point to words that are from the KB and generate the rest. This is the closest work to our approach, but has a flat memory and doesn’t incorporate multi-hop reasoning.
Hierarchical Attention was first introduced for document classification (Yang et al., 2016). Here, each document is represented as a set of sentences and each sentence as a set of words. For each sentence, an attention distribution is computed over words to identify informative words and compute a sentence representation. A similar approach identifies informative sentences to compute a document representation. Hierarchical attention has also been used for abstractive text summarization (Nallapati et al., 2016). HyP-MN similarly computes two attention distributions over different levels of the input. A word-level distribution over the words in each utterance and an utterance-level distribution over all the input utterances. This a function of these two distributions is used when copying a word in the decode process.
In this section, we briefly describe the preliminaries over which the proposed Hierarchical Pointer Memory Network (HyP-MN) is built upon. This includes (1) the multi-hop encoder in MN and (2) a standard sequence decoder with attention (Bahdanau et al., 2015).
3.1 Multi-Hop Encoder in MN
The multi-hop encoder in described in end-to-end memory network (Sukhbaatar et al., 2015) takes as input a query and a memory and generates a reduced query . Here is the embedding dimension. Augmenting the query by attending it over the memory elements, to capture relevant information necessary to generate the response, is referred to as a hop. A single hop reduced query is computed as follows:
where . The hop step can be re-iterated, by assigning the output of the previous hop as the new input query (i.e.,) . If the encoder has hops, then the final output is represented as . The multiple hops enable inference over multiple memory elements.
3.2 Sequence Decoder with Attention
The sequence decoder predicts the token in the output sequence , given the decoder state at time , , and a set of input contexts . For simplicity, we denote this conditional distribution of generating the next word as just . To compute , first an attention distribution is computed over the input contexts using Loung attention (Luong et al., 2015).
where . Then, a context vector is generated by performing a weighed sum of the input contexts using the attention distribution . The context vector concatenated with the decoder state is then used to compute the generate distribution over the decode vocabulary at time as follows:
where and are parameters to be learnt. indicates vector concatenation along the row.
During training, the objective is to minimize the average negative log-likelihood for all the words in the response. The total loss is computed by adding the loss of all the responses in the training data.
4 The HyP-MN Architecture
We now describe the architecture of our proposed Hierarchical Pointer Memory Network. At each time step , the model takes the sequence of previous user utterances and system responses to generate the next system response word by word.
Each utterance is a sequence of words . Utterance representations are generated using single layer bidirectional GRUs. Let the outputs of forward and backward GRUs for any word be denoted as and respectively. The -dimensional context dependent vector representation of the word , and vector representation of the utterance , are computed as follows:
HyP-MN has an encoder-decoder architecture as shown in Figure 1. The encoder is a multi-hop encoder as described in Section 3.1 and the decoder is built upon the standard RNN decoder described in Section 3.2.
4.1 The HyP-MN Encoder
The encoder takes in the last user utterance as the query . The remaining past utterances are placed in the memory as:
The memory has two views: word level and utterance level. is the utterance level representation of the memory. The multi hop encoder just uses the utterance level representation to compute the reduced query. The output of the encoder after hops, is assigned as the initial state of the HyP-MN decoder.
4.2 The HyP-MN Decoder
HyP-MN models a hierarchical pointer decoder, which generates the response one word at a time. At any time step, the decoder can either generate a word from the decode vocabulary or copy a word from the memory. To perform each decode step, the decoder has to compute: (1) generate distribution over the decode vocabulary, (2) copy distribution over the words in the memory, and (3) a soft switch to interpolate the two distributions to compute the final decode distribution.
The generate distribution over the decode vocabulary is computed as described in Section 3.2 – the input context is set as the utterance representation of the memory . A simplistic copy distribution is obtained by performing Luong attention over the words in the memory, using the decoder state at time , as follows:
where is a learnable parameter. The copy distribution , at time step is computed by taking a softmax over all the words in the memory. Eq. 10 is an elementary way of computing the copy distribution over the words in the memory. This approach views the memory as a bag of words and expects the contextual semantics of the word to be implicitly captured by the bi-directional GRUs used to generate the vector representation of words.
We propose a more explicit way of capturing the contextual semantics, by re-using the attention distribution computed in Eq. 5. The enhanced copy distribution is generated hierarchically by computing a softmax over the words present in each utterance and multiplying them with the corresponding utterance-level attention as follows:
This copy distribution termed as hierarchical attention is illustrated in Figure 1. This approach forces the copy distribution to have a majority of its mass distributed only on memory elements which are important for selecting the next word at time step .
Inspired by (See et al., 2017), we combine the generate and copy distribution by using a generate probability and interpolate the two distributions using it. is computed using the decoder state and the input to the sequence decoder at time as follows:
where , and are learnable parameters and is the sigmoid function. The final decode probability of a word is given by:
The final distribution is over the union of words in the decode vocabulary and the words present in the input utterances. explicitly helps the network to learn when to copy and when to generate based on the patterns in the data.
4.3 Embedding Dropout
As described in Section 4, vector representation of words are generated using bidirectional GRUs. Usually, the input to the (forward and backward) GRUs at position would be the embedding of the word at position . To handle the unusually large number of OOV words in the memory, we enrich the input by concatenating word embedding with the character embedding (each character in the word fed to a character RNN one at a time). Since OOV words are encountered only during testing, we simulate such conditions during training by randomly dropping all the word embeddings – thereby forcing the model to learn to generate response based on just the character embeddings.
Our experiments evaluate three research questions. (1) How does the performance of HyP-MN compare with the state-of-the-art task-oriented dialogue systems (Section 5.1)? (2) How effective is HyP-MN specifically for the utterances requiring reasoning over several OOV entities (Section 5.2)? And (3) What is the incremental contribution of each of HyP-MN’s modifications over the vanilla memory network (Section 5.3)?
Dataset: We perform our experiments using the bAbI dialog dataset (Bordes and Weston, 2017). The dataset consists of synthetically generated dialogues for the task of restaurant reservation. The dataset is divided into five tasks: (T1) gather user choices and generate API calls, (T2) allow the user to modify her choice and update API calls, (T3) display the options available, (T4) provide more information about the option selected and (T5) all four tasks combined. The dialogues are grounded to a KB. The KB is divided into two halves. One half is used to generate train, validation and test set, while the other half is used to generate an OOV test set. We note that while bAbI dataset does not model all complexities of real world dialogues, it provides an excellent setup to assess several important capabilities needed in a generic neural dialogue system.
Baselines: We compare HyP-MN with the state-of-the-art dialogue systems, which have reported results on the bAbI dataset, including MN (Bordes and Weston, 2017), Gated MN (Liu and Perez, 2017) and QRN (Seo et al., 2017). Previous works have also reported improved results by including a domain-specific feature called “+match type”. For a fair comparison, we keep this feature off in all models.
There exists no generative neural model that reports results on bAbI, so we compare against copy augmented Seq2Seq model (CopyS). Our CopyS implements a soft switch to tradeoff between generation and copying, instead of a hard decision.
Evaluation Metric: We evaluate HyP-MN and the baselines by their ability to generate system responses in both the test set and the OOV test set. We use per response accuracy metric proposed by Bordes and Weston. The metric is defined as the percentage of responses that are correct. For retrieval based approaches like MN, GMN and QRN222Error rates reported in Seo et al. (2017) assigns partial credit based on word-overlap between predicted and actual response. For fair comparison, we report strict match accuracies. the predicted response is correct if it is exactly the same as the actual response. For approaches that generate responses word by word (HyP-MN and CopyS), we create two measures. The strict measure declares a predicted response as correct only if it is exactly the same as actual response. We also report an almost-match score, which is a (slightly) lenient measure that allows up to two spurious words, as long as the original response is a substring in the predicted response.
5.1 Comparison against Baseline Systems
Table 1 reports the performance of all models. For CopyS, we report only almost-match score. For HyP-MN, we report both, with strict measure in parentheses. In aggregate, HyP-MN (almost-match) outperforms the baselines by about 12% on test set and by 6.6-12% on the OOV test set. Even with strict measure, HyP-MN achieves a significant aggregate win, losing only a little in Task 5.
We note that in tasks 1 and 2, all approaches have relatively similar performance. However, significant variations in tasks 3-5 highlight individual strengths and weaknesses of each model. For example, CopyS and HyP-MN substantially outperform MNs and QRN in OOV test set for tasks 3 and 4. This emphasizes the importance of copying words from context in the response.
Task 4 itself is much easier than Task 3, since the former requires looking up only a single KB entity, whereas the latter needs to perform inference over multiple KB entities based on the conversation so far (see Section 5.2 for a detailed discussion of Task 3). The 8% improvement of HyP-MN over CopyS in Task 3 (both test and OOV test set) suggests that HyP-MN is much better at performing inferences over multiple entities over long utterance sequences.
We also observe that CopyS has a rather weak performance in Task 5. Since this task is generated by combining all the sub-tasks 1-4, the dialogues are much longer. This exposes CopyS’s inability to model long dialogues, rendering it unsuitable for real-world systems. On the test set, HyP-MN has comparable performance on Task 5, but it doesn’t achieve significant gain over MNs and QRN for OOV test set. We discuss this in error analysis below.
Issues in bAbI Dataset: While we expected significant improvement of HyP-MN over MNs in tasks 3 and 4 on OOV test set, we were surprised to find similar improvements on the non-OOV test sets. Careful analysis revealed an error in bAbI dataset construction. Unfortunately, we find that KB entities present in validation and non-OOV test sets do not overlap with those in the train set. This effectively means that non-OOV and OOV test conditions for tasks 3 and 4 are the same. This explains HyP-MN’s improvements in both the settings.
Error Analysis for HyP-MN: We observe that HyP-MN (and also CopyS) could not improve results on tasks 1 and 2 on OOV test set. There are two reasons for this, one pertaining to the specific construction of OOV data sets and the other to the model. First, the validation sets of tasks 3 and 4 have OOV entities – this encourages the model to learn that in specific cases, copying is better than generating. Unfortunately, this is not so in validation sets of tasks 1 and 2 – they only contain entities from the train set. Thus, the model gets no feedback that it needs to learn to copy. To remove this bias, we generate new validation sets with only OOV KB entities.
Retraining HyP-MN with this new validation sets helps the accuracy on Task 1 to jump to 83%. But, it is still far behind the performance on non-OOV test set for a second, subtle reason. The relevant entities in tasks 1 and 2 are cuisines (whereas they are restaurant names and their attributes in tasks 3 and 4). Each KB entity in tasks 3 and 4 is repeated only in about 10 out of 1,000 dialogs, whereas KB entities in tasks 1 and 2 are repeated 100-200 times out of 1,000 dialogs (there are only a handful of cuisine types). Given the huge frequency, HyP-MN learns to confidently recognize cuisines just by word and character embeddings, and does not get encouraged to carefully model the context in which a new cuisine may be mentioned. We believe that curriculum training, which drops out both character and word embeddings, might force it to model context better. We plan to investigate this extension in the future.
In Task 5, the strict measure does significantly worse than almost-match score. We observe that after generating the correct response, sometimes, the model adds a spurious word or two. A language model guided generation or training on a larger dataset will likely be able to fix this issue.
Finally, we also observe that in Task 5 (OOV) HyP-MN accuracy is not that much better than MNs. Task 5 has extremely long utterance sequences with complexities of each previous task. While we believe HyP-MN has the ability to learn a much better representation for this consolidated task, it is possible that the variable starting points of KB results is confusing the model, and it is not able to learn which memory cells are worth pointing to. We expect that a much larger train set could show significantly improved performance.
5.2 A Closer Look at Task 3
The goal of Task 3 is to suggest a restaurant based on the KB results present in the memory until the user accepts one. The restaurants should ideally be suggested in decreasing order of their star ratings, and previously rejected restaurants (by the user) should not be suggested again. A high performance on Task 3 is particularly important because (1) it is the only task that performs inference over the KB results, a necessary quality for any KB-grounded dialogue system, and (2) suggesting the restaurant is a crucial exchange in the entire task of restaurant recommendations. This would be critical for the user to be satisfied with the system.
In order to assess better whether the goal of the dialogue is being achieved or not, we define two task-specific metrics for Task 3. Our first metric, restaurant recommendation accuracy (), measures the percentage of times the best next restaurant is suggested by a system. Our second metric, restaurant relevance accuracy (), is defined as the percentage of times, a suggested restaurant is picked from the KB results in the memory. Table 2 shows the two metrics for MN, CopyS and HyP-MN. Additionally, we also compute these metrics for Pointer-Memory Network (P-MN), a variant of HyP-MN with non-hierarchical pointers, in which each memory element contains a single word in a flat sequence.
The results show that all approaches that can copy successfully learn to pick restaurants from KB results, thus achieving a perfect score on the metric. MN and QRN, on the other hand, has the worst possible score on this metric – it always recommends a restaurant that is completely irrelevant to the context. This underscores the importance of pointer networks.
In order to predict the best next restaurant, the system must (1) sort the restaurants based on the rating and (2) keep track of all the restaurants suggested so far, and avoid repeating them. We find that CopyS makes several errors, as it is unable to sort as well as refrain from repeating an already suggested restaurant. P-MN, on the other hand, is unable to sort, but is able to learn that restaurants should not be repeated. HyP-MN achieves the best recommendation accuracy by learning to sort much better than other networks.
5.3 Ablation Study
We assess the value of each model element, by incrementally adding functionality to a basic encoder-decoder model. Our basic encoder-decoder model has a multi-hop memory network encoder and a standard sequence decoder with attention over memory. We refer to this model as generative. We then supplement input word embeddings with the character embedding of the word. Our next model adds the ability to copy via a flat pointer network, P-MN. Finally, we make our memory hierarchical and implement hierarchical attention to obtain HyP-MN.
Table 3 reports the almost-match accuracy scores for this experiment, averaged across all tasks. We find that character embeddings help OOV accuracy slightly more than the non-OOV one. We note that vanilla generative models, with or without character embeddings, are not able to surpass MN accuracies. We see a dramatic improvement of 19% after adding the ability to copy. This provides the necessary boost for the model to go past the basic MN. Adding hierarchical pointing provides better inferencing capability and improves accuracies further.
|+ Char Embeddings||75.2||62.6|
|+ Hierarchical Pointers||98.3||82.3|
Figure 2 visualizes the attention weights on words in the memory, at a single decode step. We compare two models from our ablation study: HyP-MN and P-MN. In this example, we are at the step of the decoder aimed at predicting the next restaurant, given that the restaurant with rating 8 stars has already been suggested. We show attention only on the KB entries and shorten the restaurant names, phone numbers, and addresses for brevity. For example, resto_tokyo_moderate_korean_8stars is abbreviated to rest_8_str.
Both models share several similarities in their distribution of attention weights. First, we notice that attention weights are localized over the restaurant names, indicating the preference of the system to point to a specific restaurant. This is supported by the values, x and x for HyP-MN and P-MN respectively, i.e., both models prefer copying the next word rather than generation. Moreover, entries with the same restaurant name have similar attention weights, reflecting the robustness of the distribution.
We also observe that HyP-MN is able to learn the difficult task of sorting the restaurant entries based on decreasing order of rating (number of stars). It gives more weight to entries with a high rating (3 stars 2 stars 1 star) and suppresses the weights of any previously suggested restaurant, for example, 8 stars. We also see a high correlation between utterance-level attention (ULA), and hierarchical attention. This leads us to believe that ULA plays an important role in sorting the restaurants based on ratings.
In case of P-MN, we deduce that sorting restaurants based on rating is not an intrinsic property of the model – the attention distribution is haphazardly spread, without a clear favorite, across all the restaurant names not yet suggested by the system. This explains the higher accuracy of HyP-MN over P-MN in the task of restaurant prediction shown in Table 2.
We propose a Hierarchical Memory Pointer Network (HyP-MN) architecture for learning task-oriented dialogues that are grounded in a KB. HyP-MN combines a multi-hop encoder with a sequence decoder that has an ability to copy words from the input. Moreover, it stores the input via a hierarchical memory, and applies a hierarchical attention at decode time. Task-oriented dialogues need the ability to refer to OOV entities outputted by the KB (which may not have been seen at training time) and construct reasoning chains over long utterances. HyP-MN accomplishes both tasks owing to its pointer network and multi-hop encoder, respectively.
Several experiments over the bAbI dialog datasets demonstrate HyP-MN’s significantly improved performance over various state-of-the-art baselines. It achieves 6-12% accuracy improvements, in aggregate, and has a particularly good performance on Task 3, one of the hardest tasks in bAbI. Further analysis reveals that it recommends the best next restaurant option to the user much more effectively than the baselines. In the future, we will build extensions towards improving performance on tasks 1, 2 and 5, with OOV entities.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
- Bordes and Weston (2017) Antoine Bordes and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In International Conference on Learning Representations.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.
- Eric and Manning (2017) Mihail Eric and Christopher Manning. 2017. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 468–473. Association for Computational Linguistics.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640. Association for Computational Linguistics.
- Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Steve Young. 2014. Word-based dialog state tracking with re- current neural networks. In In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 292–299.
- Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387.
- Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, pages 733–743. Asian Federation of Natural Language Processing.
- Liu and Perez (2017) Fei Liu and Julien Perez. 2017. Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 1–10.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290. Association for Computational Linguistics.
- Raux et al. (2005) Antoine Raux, Brian Langner, Dan Bohus, Alan W. Black, and Maxine Eskénazi. 2005. Let’s go public! taking a spoken dialog system to the real world. In INTERSPEECH.
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.
- Seo et al. (2017) Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Query-reduction networks for question answering. In International Conference on Learning Representations.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.