Multi-style Generative Reading Comprehension
This study focuses on the task of multi-passage reading comprehension (RC) where an answer is provided in natural language. Current mainstream approaches treat RC by extracting the answer span from the provided passages and cannot generate an abstractive summary from the given question and passages. Moreover, they cannot utilize and control different styles of answers, such as concise phrases and well-formed sentences, within a model. In this study, we propose a style-controllable Multi-source Abstractive Summarization model for QUEstion answering, called Masque. The model is an end-to-end deep neural network that can generate answers conditioned on a given style. Experiments with MS MARCO 2.1 show that our model achieved state-of-the-art performance on two tasks with different answer styles.
Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazutoshi Shinoda††thanks: Work done during an internship at NTT., Atsushi Otsuka, Hisako Asano, Junji Tomita NTT Media Intelligence Laboratory, NTT Corporation The University of Tokyo email@example.com
Question answering has been a long-standing research problem. Recently, reading comprehension (RC), a challenge to answer a question given textual evidence provided in a document set, has received much attention. Here, current mainstream studies have treated RC as a process of extracting an answer span from one passage (Rajpurkar et al., 2016, 2018) or multiple passages (Joshi et al., 2017), which is usually done by predicting the start and end positions of the answer (Yu et al., 2018; Devlin et al., 2018).
The demand for answering questions in natural language is increasing rapidly, and this has led to the development of smart devices such as Siri and Alexa. However, in comparison with answer span extraction, the natural language generation (NLG) ability for RC has been less studied. While datasets such as MS MARCO (Bajaj et al., 2018) have been proposed for providing abstractive answers in natural language, the state-of-the-art methods (Wu et al., 2018; Yan et al., 2018) are based on answer span extraction, even for the datasets. Generative models such as S-Net (Tan et al., 2018) suffer from a dearth of training data to cover open-domain questions.
Moreover, to satisfy various information needs, intelligent agents should be capable of answering one question in multiple styles, such as concise phrases that do not contain the context of the question and well-formed sentences that make sense even without the context of the question. These capabilities complement each other; however, the methods used in previous studies cannot utilize and control different answer styles within a model.
In this study, we propose a generative model, called Masque, for multi-passage RC. On the MS MARCO 2.1 dataset, Masque achieves state-of-the-art performance on the dataset’s two tasks, Q&A and NLG, with different answer styles. The main contributions of this study are that our model enables the following two abilities.
Multi-source abstractive summarization based RC.
The first idea is to use a pointer-generator mechanism for multi-passage RC, which was originally proposed for text summarization (See et al., 2017). Hasselqvist et al. (2017) and McCann et al. (2018) had introduced its RNN-based mechanism to query-based abstractive summarization and question answering, respectively; however, their models cannot handle multiple passages effectively. We extend the mechanism to a Transformer (Vaswani et al., 2017) based one that allows words to be generated from a fixed vocabulary and words to be copied from both the question and multiple passages.
The second novel idea is to introduce a method to control multiple answer styles using a single model, taking advantage of multi-style answers to improve RC for all styles involved. We also extend the pointer-generator mechanism to a conditional decoder simply by introducing an artificial token corresponding to the target style, like (Johnson et al., 2017; Takeno et al., 2017). It controls the mixture weights over three probability distributions with the given style at each decoding step, as shown in Figure 1.
2 Problem Formulation
The task considered in this paper, is defined as:
Given a question with words , a set of passages, where each -th passage is composed of words , and an answer style , an RC system outputs an answer conditioned on the style.
In short, for inference, given a set of 3-tuples , the system predicts . The training data is a set of 6-tuples: , where is if the question is answerable with the provided passages and otherwise, and is if the -th passage is required to formulate the answer and otherwise.
3 Proposed Model
Our proposed model, Masque, is based on multi-source abstractive summarization; the answer our model generates can be viewed as a summary from the question and multiple passages. It is also style-controllable; one model can generate the answer with the target style.
Masque directly models the conditional probability . In addition to multi-style learning, it considers passage ranking and answer possibility classification together as multi-task learning in order to improve accuracy. Figure 2 shows the model architecture. It consists of the following modules.
The question-passages reader (§3.1) models interactions between the question and passages.
The passage ranker (§3.2) finds relevant passages to the question.
The answer possibility classifier (§3.3) identifies answerable questions.
The answer sentence decoder (§3.4) outputs a sequence of words conditioned on the style.
3.1 Question-Passages Reader
Given a question and passages, the question-passages reader matches them so that the interactions among the question (passage) words conditioned on the passages (question) can be captured.
3.1.1 Word Embedding Layer
Let and represent one-hot vectors of words in the question and -th passage. First, this layer projects each of the one-hot vectors (of size ) into a -dimensional continuous vector space with a pre-trained weight matrix such as GloVe (Pennington et al., 2014). Next, it uses contextualized word representations, ELMo (Peters et al., 2018), which is a character-level two-layer bidirectional language model pre-trained on a large-scale corpus. ELMo representations allow our model to use morphological clues to form robust representations for out-of-vocabulary words unseen in training. Then, the concatenation of the word and contextualized embedding vectors is passed to a two-layer highway network (Srivastava et al., 2015) that is shared for the question and passages.
3.1.2 Shared Encoder Layer
This layer uses a stack of Transformer blocks, which are shared for the question and passages, on top of the embeddings provided by the word embedding layer. The input of the first block is immediately mapped to a -dimensional vector by a linear transformation. The outputs of this layer are sequences of -dimensional vectors: for the -th passage and for the question.
Transformer encoder block.
It consists of two sub-layers: a self-attention layer and a position-wise feed-forward network. For the self-attention layer, we adopt the multi-head attention mechanism defined in (Vaswani et al., 2017). The feed-forward network consists of two linear transformations with a GELU (Hendrycks and Gimpel, 2016) activation in between, following OpenAI GPT (Radford et al., 2018). Each sub-layer is placed inside a residual block (He et al., 2016). For an input and a given sub-layer function , the output is , where indicates the layer normalization proposed in (Ba et al., 2016). To facilitate these residual connections, all sub-layers produce outputs of dimension . Note that our model does not use any position embeddings because ELMo gives the positional information of the words in each sequence.
3.1.3 Dual Attention Layer
This layer fuses information from the passages to the question as well as from the question to the passages in a dual mechanism.
It first computes a similarity matrix between the question and -th passage, as is done in (Seo et al., 2017), where
indicates the similarity between the -th word of the -th passage and the -th question word. are learnable parameters. The operator denotes the Hadamard product, and the operator means vector concatenation across the rows. Next, it obtains the row and column normalized similarity matrices and . We use DCN (Xiong et al., 2017) as the dual attention mechanism to obtain question-to-passage representations :
and passage-to-question ones :
3.1.4 Modeling Encoder Layer
This layer uses a stack of Transformer encoder blocks for question representations and obtains from . It also uses an another stack for passage representations and obtains from for each -th passage. The outputs of this layer, and , are passed on to the answer sentence decoder; the are also passed on to the passage ranker and answer possibility classifier.
3.2 Passage Ranker
The passage ranker maps the output of the modeling layer, , to the relevance score of each passage. To obtain a fixed-dimensional pooled representation of each passage sequence, this layer takes the output for the first passage word, , which corresponds to the beginning-of-sentence token. It calculates the relevance of each -th passage to the question as:
where are learnable parameters.
3.3 Answer Possibility Classifier
The answer possibility classifier maps the output of the modeling layer, , to the probability of the answer possibility. The classifier takes the output for the first word, , for all passages and concatenates them to obtain a fixed-dimensional representation. It calculates the answer possibility to the question as:
where are learnable parameters.
3.4 Answer Sentence Decoder
Given the outputs provided by the reader, the decoder generates a sequence of answer words one element at a time. It is auto-regressive (Graves, 2013), consuming the previously generated words as additional input at each decoding step.
3.4.1 Word Embedding Layer
Let represent one-hot vectors of words in the answer. This layer has the same components as the word embedding layer of the question-passages reader, except that it uses a unidirectional ELMo in order to ensure that the predictions for position depend only on the known outputs at positions less than .
Moreover, to be able to make use of multiple answer styles within a single system, our model introduces an artificial token corresponding to the target style at the beginning of the answer sentence (), like (Takeno et al., 2017). At test time, the user can specify the first token to control the answer styles. This modification does not require any changes to the model architecture. Note that introducing the tokens on the decoder side prevents the passage ranker and answer possibility classifier from depending on the answer style.
3.4.2 Attentional Decoder Layer
This layer uses a stack of Transformer decoder blocks on top of the embeddings provided by the word embedding layer. The input is immediately mapped to a -dimensional vector by a linear transformation, and the output of this layer is a sequence of -dimensional vectors: .
Transformer decoder block.
In addition to the encoder block, this block consists of second and third sub-layers after the self-attention block and before the feed-forward network, as shown in Figure 2. As in (Vaswani et al., 2017), the self-attention sub-layer uses a sub-sequent mask to prevent positions from attending to subsequent positions. The second and third sub-layers perform the multi-head attention over and , respectively. The is the concatenated outputs of the encoder stack for the passages,
The operator means vector concatenation across the columns. This attention for the concatenated passages enables our model to produce attention weights that are comparable between passages.
3.4.3 Multi-source Pointer-Generator
Our extended mechanism allows both words to be generated from a fixed vocabulary and words to be copied from both the question and multiple passages. Figure 3 shows the overview.
Extended vocabulary distribution.
Let the extended vocabulary, , be the union of the common words (a small subset of the full vocabulary, , defined by the reader-side word embedding matrix) and all words appearing in the input question and passages. denotes the probability distribution of the -th answer word, , over the extended vocabulary. It is defined as:
where the output embedding is tied with the corresponding part of the input embedding (Inan et al., 2017), and and are learnable parameters. is zero if is an out-of-vocabulary word for .
The copy mechanism used in the original pointer-generator is based on the attention weights of a single-layer attentional RNN decoder (See et al., 2017). The attention weights in our decoder stack are the intermediate outputs in multi-head attentions and are not suitable for the copy mechanism. Therefore, our model also uses additive attentions for the question and multiple passages on top of the decoder stack.
The layer takes as the query and outputs () as the attention weights and () as the context vectors for the question (passages):
where , , , , , , and , are learnable parameters.
and are the copy distributions over the extended vocabulary, defined as:
where means the passage index corresponding to the -th word in the concatenated passages.
The final distribution of the -th answer word, , is defined as a mixture of the three distributions:
where the mixture weights are given by
, are learnable parameters.
3.4.4 Combined Attention
In order not to use words in irrelevant passages, our model introduces the concept of combined attention (Sun et al., 2018b). While the original technique combines the word and sentence level attentions, our model combines the passage-level relevance and word-level attentions by using simple scalar multiplication and re-normalization. The updated word attention is:
3.5 Loss Function
We define the training loss as the sum of losses in
where is the set of all learnable parameters, and and are balancing parameters.
The loss of the decoder, , is the negative log likelihood of the whole target answer sentence averaged over answerable examples:
where is the training dataset.
The losses of the passage ranker, , and the answer possibility classifier, , are the binary cross entropy between the true and predicted values averaged over all examples:
Datasets and styles.
We conducted experiments on the two tasks of MS MARCO 2.1 (Bajaj et al., 2018). The answer styles considered in the experiments corresponded to the two tasks. The NLG task requires a well-formed answer that is an abstractive summary of the question and ten passages, averaging 16.6 words. The Q&A task also requires an abstractive answer but prefers a more concise answer than the NLG task, averaging 13.1 words, where many of the answers do not contain the context of the question. For instance, for the question “tablespoon in cup”, the answer in the Q&A task will be “16”, and the answer in the NLG task will be “There are 16 tablespoons in a cup.” In addition to the ALL dataset, we prepared two subsets (Table 1). The ANS set consists of answerable questions, and the WFA set consists of the answerable questions and well-formed answers, where WFA ANS ALL.
We trained our model on a machine with eight NVIDIA P100 GPUs. Our model was jointly trained with the two answer styles in the ALL set for a total of eight epochs with a batch size of 80. The training took roughly six days. The ensemble model consists of six training runs with the identical architecture and hyperparameters. The hidden size was 304, and the number of attention heads was 8. The inner state size of the feed-forward networks was 256. The numbers of shared encoding blocks, modeling blocks for question, modeling blocks for passages, and decoder blocks were 3, 2, 5, and 8, respectively. We used the pre-trained uncased 300-dimensional GloVe (Pennington et al., 2014) and the original 512-dimensional ELMo (Peters et al., 2018). We used the spaCy tokenizer, and all words were lowercased except the input for ELMo. The number of common words in was 5,000.
We used the Adam optimization (Kingma and Ba, 2015) with , , and . Weights were initialized using , except that the biases of all the linear transformations were initialized with zero vectors. The learning rate was increased linearly from zero to in the first 2,000 steps and annealed to 0 using a cosine schedule. All parameter gradients were clipped to a maximum norm of . An exponential moving average was applied to all trainable variables with a decay rate 0.9995. The balancing factors of joint learning, and , were set to 0.5 and 0.1.
We used a modified version of the L regularization proposed in (Loshchilov and Hutter, 2017), with . We additionally used a dropout (Srivastava et al., 2014) rate of 0.3 for all highway networks and residual and scaled dot-product attention operations in the multi-head attention mechanism. We also used one-sided label smoothing (Szegedy et al., 2016) for the passage relevance and answer possibility labels. We smoothed only the positive labels to 0.9.
Does our model achieve state-of-the-art performance for generative RC?
Table 2 shows that our ensemble model, controlled with the NLG and Q&A styles, achieved state-of-the-art performance on the NLG and Q&A tasks in terms of Rouge-L. In particular, for the NLG task, our single model outperformed competing models in terms of both Rouge-L and Bleu-1. The capability of creating abstractive summaries from the question and passages contributed to its improvements over the state-of-the-art extractive approaches (Wu et al., 2018; Yan et al., 2018).
|Deep Cascade QA (2018)||35.14||37.35||52.01||54.64|
|S-Net (2018)111An unpublished variant by Bo Shao of SYSU University.||45.04||40.62||44.96||46.36|
|Masque (NLG; single)||49.19||49.63||48.42||48.68|
|Masque (Q&A; single)||25.66||36.62||50.93||42.37|
|Masque (NLG; ensemble)||49.61||50.13||48.92||48.75|
|Masque (Q&A; ensemble)||28.53||39.87||52.20||43.77|
|Masque (NLG style; single)||ALL||69.77||65.56|
|Masque w/ gold passage ranker||ALL||78.70||78.14|
Does our multi-style learning improve NLG performance?
Table 3 shows the results of the ablation test for our model (controlled with the NLG style) on the well-formed answers of the WFA dev. set. Our model, which was trained with the ALL set consisting of the two styles, outperformed the model trained with the WFA set consisting of the single style. Multi-style learning allowed our model to improve NLG performance by also using non-sentence answers.
Does our Transformer-based pointer-generator improve NLG performance?
Does our joint learning with the ranker and classifier improve NLG performance?
Table 3 shows that our model (jointly trained with the passage ranker and answer possibility classifier) outperformed the model that did not use the ranker and classifier. The joint learning has a regularization effect on the question-passages reader.
We also confirmed that the gold passage ranker, which can predict passage relevances perfectly, improves RC performance significantly. Passage re-ranking will be a key to developing a system that can outperform humans.
Does our joint learning improve the passage re-ranking performance?
|Bing (initial ranking)||-||34.62||35.00|
Table 4 shows the passage re-ranking performance for the ten given passages on the ANS dev. set. Our ranker improved the initial ranking provided by Bing by a significant margin. Also, the ranker shares the question-passages reader with the answer decoder, and this sharing contributed to the improvements over the ranker trained without the answer decoder. This result is similar to those reported in (Nishida et al., 2018). Moreover, the joint learning with the answer possibility classifier and multiple answer styles, which enables our model to learn from a larger number of data, improved the re-ranking.
Does our model accurately identify answerable questions?
Figure 4 shows the precision-recall curve of answer possibility classification on the ALL dev. set, where the positive class is the answerable data. Our model identified the answerable questions well. The maximum score was 0.7893. This is the first report on answer possibility classification with MS MARCO 2.1.
Does our model accurately control answers with different styles?
Figure 5 shows the lengths of the answers generated by our model, which are broken down by answer style and query type. The generated answers were relatively shorter than the reference answers but well controlled with the target style in every query type.
Also, we should note that our model does not guarantee the consistency in terms of meaning across the answer styles. We randomly selected 100 questions and compared the answers our model generated with the NLG and Q&A styles. The consistency ratio was 0.81, where major errors were due to copying words from different parts of the passages and generating different words, especially yes/no, from a fixed vocabulary.
Appendix A shows examples of generated answers. We found (d) style errors; (e) yes/no classification errors; (f) copy errors with respect to numerical values; and (c,e) grammatical errors that were originally contained in the inputs.
5 Related Work and Discussion
RC with NLG.
MCAN (McCann et al., 2018) frames various tasks as question answering tasks that take a 3-tuple (question, context, answer) as inputs. It uses a pointer-generator decoder to jointly learn all tasks without any task-specific parameters; unlike ours, it cannot modify answers with the target style or handle multiple passages. S-Net (Tan et al., 2018) uses a generative model for multi-passage RC. It uses answer extraction to predict the most important spans from the passage as evidence; then it uses the evidence to generate the final answers. However, it does not handle the extended vocabulary in order to generate words appearing in the question and passages.
To the best of our knowledge, there are no datasets for providing answers in natural language with multiple styles except MS MARCO 2.1, although there are some datasets that provide abstractive answers. DuReader (He et al., 2018), a Chinese multi-document RC dataset, provides the top-10 ranked entire documents from Baidu Search and Zhidao. Many of the answers are long and relatively far from the source documents compared with those from MS MARCO. NarrativeQA (Kociský et al., 2018) proposed a dataset about stories or summaries of books or movie scripts. The documents are long, averaging 62,528 (659) words in stories (summaries), while the answers are relatively short, averaging 4.73 words. Moreover, DuoRC (Saha et al., 2018) and CoQA (Reddy et al., 2018) contain abstractive answers; most of the answers are short phrases.
Controllable text generation.
Many studies have been carried out in the framework of style transfer, which is the task of rephrasing the text so that it contains specific styles such as sentiment. Recent work uses artificial tokens (Sennrich et al., 2016; Johnson et al., 2017), variational auto-encoders (Hu et al., 2017), adversarial training (Fu et al., 2018; Tsvetkov et al., 2018), or prior knowledge (Li et al., 2018b) to separate the content and style on the encoder side. On the decoder side, conditional language modeling has been used to generate output sentence with the target style. In addition to style transfer, output length control with conditional language modeling has been well studied (Kikuchi et al., 2016; Takeno et al., 2017; Fan et al., 2018). Our style-controllable RC relies on conditional language modeling on the decoder side.
The simplest approach is to concatenate the passages and find the answer from the concatenated one as in (Wang et al., 2017). Earlier pipeline models find a small number of relevant passages with a TF-IDF based ranker and pass them to a neural reader (Chen et al., 2017; Clark and Gardner, 2018), while more recent pipeline models use a neural re-ranker to more accurately select the relevant passages (Wang et al., 2018; Nishida et al., 2018). Also, non-pipelined models (including ours) consider all the provided passages and find the answer by comparing scores between passages (Tan et al., 2018; Wu et al., 2018). The most recent models make a proper trade-off between efficiency and accuracy (Yan et al., 2018; Min et al., 2018).
RC with unanswerable question identification.
The previous work of (Levy et al., 2017; Clark and Gardner, 2018) outputs a no-answer score depending on the probability of all answer spans. Hu et al. Hu et al. (2018) proposed an answer verifier to compare the answer sentence with the question. Sun et al. Sun et al. (2018a) proposed a unified model that jointly learns an RC model and an answer verifier. Our model introduces a classifier on the basis of question-passages matching, which is not dependent on the generated answer, unlike the previous methods.
Current state-of-the-art models use pointer-generator mechanisms (See et al., 2017). In particular, content selection approaches, which decide what to summarize, have recently been used with abstractive models. Most methods select content at the sentence level (Hsu et al., 2018; Chen and Bansal, 2018) and the word level (Pasunuru and Bansal, 2018; Li et al., 2018a; Gehrmann et al., 2018); our model incorporates content selection at the passage level in the combined attention.
Query-based abstractive summarization has been rarely studied. Nema et al. (2017) proposed an attentional encoder-decoder model, and Saha et al. (2018) reported that it performed worse than BiDAF on DuoRC. Hasselqvist et al. (2017) proposed a pointer-generator based model; however, it does not consider copying words from the question and multiple passages.
We believe our study makes two contributions to the study of multi-passage RC with NLG. Our model enables 1) multi-source abstractive summarization based RC and 2) style-controllable RC. The key strength of our model is its high accuracy of generating abstractive summaries from the question and passages; our model achieved state-of-the-art performance in terms of Rouge-L on the Q&A and NLG tasks of MS MARCO 2.1 that have different answer styles (Bajaj et al., 2018).
The styles considered in this paper are only related to the context of the question in the answer sentence; our model will be promising for controlling other styles such as length and speaking styles. Future work will involve exploring the potential of hybrid models combining extractive and abstractive approaches and improving the passage re-ranking and answerable question identification.
- Ba et al. (2016) Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv, 1607.06450.
- Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A human generated machine reading comprehension dataset. arXiv, 1611.09268v3.
- Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL), pages 1870–1879.
- Chen and Bansal (2018) Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Association for Computational Linguistics (ACL), pages 675–686.
- Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Association for Computational Linguistics (ACL), pages 845–855.
- Craswell (2009) Nick Craswell. 2009. Mean reciprocal rank. In Encyclopedia of Database Systems, page 1703.
- Craswell and Robertson (2009) Nick Craswell and Stephen Robertson. 2009. Average precision at n. In Encyclopedia of Database Systems, pages 193–194.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv, 1810.04805.
- Fan et al. (2018) Angela Fan, David Grangier, and Michael Auli. 2018. Controllable abstractive summarization. In Workshop on Neural Machine Translation and Generation (NMT@ACL), pages 45–54.
- Fu et al. (2018) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. In Association for the Advancement of Artificial Intelligence (AAAI), pages 663–670.
- Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. 2018. Bottom-up abstractive summarization. In Empirical Methods in Natural Language Processing (EMNLP), pages 4098–4109.
- Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv, 1308.0850.
- Hasselqvist et al. (2017) Johan Hasselqvist, Niklas Helmertz, and Mikael Kågebäck. 2017. Query-based abstractive summarization using neural networks. arXiv, 1712.06100.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- He et al. (2018) Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2018. Dureader: a chinese machine reading comprehension dataset from real-world applications. In Workshop on Machine Reading for Question Answering (MRQA@ACL), pages 37–46.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv, 1606.08415.
- Hsu et al. (2018) Wan Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Association for Computational Linguistics (ACL), pages 132–141.
- Hu et al. (2018) Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang, Nan Yang, and Ming Zhou. 2018. Read + verify: Machine reading comprehension with unanswerable questions. arXiv, 1808.05759.
- Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In International Conference on Machine Learning (ICML), pages 1587–1596.
- Inan et al. (2017) Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations (ICLR).
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistic (TACL), 5:339–351.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL), pages 1601–1611.
- Kikuchi et al. (2016) Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2016. Controlling output length in neural encoder-decoders. In Empirical Methods in Natural Language Processing (EMNLP), pages 1328–1338.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- Kociský et al. (2018) Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistic (TACL), 6:317–328.
- Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Computational Natural Language Learning (CoNLL), pages 333–342.
- Li et al. (2018a) Chenliang Li, Weiran Xu, Si Li, and Sheng Gao. 2018a. Guiding generation for abstractive text summarization based on key information guide network. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 55–60.
- Li et al. (2018b) Juncen Li, Robin Jia, He He, and Percy Liang. 2018b. Delete, retrieve, generate: a simple approach to sentiment and style transfer. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1865–1874.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. arXiv, 1711.05101.
- McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv, 1806.08730.
- Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In Association for Computational Linguistics (ACL), pages 1725–1735.
- Nema et al. (2017) Preksha Nema, Mitesh M. Khapra, Anirban Laha, and Balaraman Ravindran. 2017. Diversity driven attention model for query-based abstractive summarization. In Association for Computational Linguistics (ACL), pages 1063–1072.
- Nishida et al. (2018) Kyosuke Nishida, Itsumi Saito, Atsushi Otsuka, Hisako Asano, and Junji Tomita. 2018. Retrieve-and-read: Multi-task learning of information retrieval and reading comprehension. In Conference on Information and Knowledge Management (CIKM), pages 647–656.
- Pasunuru and Bansal (2018) Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 646–653.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2227–2237.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL), pages 784–789.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392.
- Reddy et al. (2018) Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. Coqa: A conversational question answering challenge. arXiv, 1808.07042.
- Saha et al. (2018) Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. Duorc: Towards complex language understanding with paraphrased reading comprehension. In Association for Computational Linguistics (ACL), pages 1683–1693.
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Association for Computational Linguistics (ACL), pages 1073–1083.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neural machine translation via side constraints. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 35–40.
- Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations (ICLR).
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
- Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. arXiv, 1505.00387.
- Sun et al. (2018a) Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu. 2018a. U-net: Machine reading comprehension with unanswerable questions. arXiv, 1810.06638.
- Sun et al. (2018b) Min Sun, Wan Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, and Jing Tang. 2018b. A unified model for extractive and abstractive summarization using inconsistency loss. In Association for Computational Linguistics (ACL), pages 132–141.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
- Takeno et al. (2017) Shunsuke Takeno, Masaaki Nagata, and Kazuhide Yamamoto. 2017. Controlling target features in neural machine translation via prefix constraints. In Workshop on Asian Translation (WAT@IJCNLP), pages 55–63.
- Tan et al. (2018) Chuanqi Tan, Furu Wei, Nan Yang, Bowen Du, Weifeng Lv, and Ming Zhou. 2018. S-net: From answer extraction to answer synthesis for machine reading comprehension. In Association for the Advancement of Artificial Intelligence (AAAI), pages 5940–5947.
- Tsvetkov et al. (2018) Yulia Tsvetkov, Alan W. Black, Ruslan Salakhutdinov, and Shrimai Prabhumoye. 2018. Style transfer through back-translation. In Association for Computational Linguistics (ACL), pages 866–876.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), pages 6000–6010.
- Wang et al. (2018) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2018. R: Reinforced reader-ranker for open-domain question answering. In Association for the Advancement of Artificial Intelligence (AAAI).
- Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Association for Computational Linguistics (ACL), pages 189–198.
- Wu et al. (2018) Hua Wu, Haifeng Wang, Sujian Li, Wei He, Yizhong Wang, Jing Liu, Kai Liu, and Yajuan Lyu. 2018. Multi-passage machine reading comprehension with cross-passage answer verification. In Association for Computational Linguistics (ACL), pages 1918–1927.
- Xiong et al. (2017) Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic coattention networks for question answering. In International Conference on Learning Representations (ICLR).
- Yan et al. (2018) Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, Zhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, and Haiqing Chen. 2018. A deep cascade model for multi-document reading comprehension. arXiv, 1811.11374.
- Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In International Conference on Learning Representations (ICLR).
Appendix A Reading Comprehension Examples generated by Masque from MS MARCO 2.1
(a) Question: why your body would feel like it is shaking
Relevant Passage: Shaking is a symptom in which a person has tremors (shakiness or small back and forth movements) in part or all of his body. Shaking can be due to cold body temperatures, rising fever (such as with infections), neurological problems, medicine effects, drug abuse, etc. …Read more.
Reference Answer (Q&A): Shaking can be due to cold body temperatures, rising fever (such as with infections), neurological problems, medicine effects, drug abuse, etc.
Prediction (Q&A): because of cold body temperatures , rising fever , neurological problems , medicine effects , drug abuse . ✓
Reference Answers (NLG): Shaking can be due to cold body temperatures, rising fever, neurological problems, medicine effects and drug abuse. / Body would feel like it is shaking due to cold body temperatures, rising fever, neurological problems, medicine effects, drug abuse.
Prediction (NLG): your body would feel like it is shaking because of cold body temperatures , rising fever , neurological problems , medicine effects , drug abuse . ✓
(b) Question: _____ is the name used to refer to the era of legalized segregation in the united states
Relevant Passage: Jim Crow law, in U.S. history, any of the laws that enforced racial segregation in the South between the end of Reconstruction in 1877 and the beginning of the civil rights movement in the 1950s.
Reference Answer (Q&A): Jim Crow
Prediction (Q&A): jim crow ✓
Reference Answer (NLG): Jim Crow is the name used to refer to the era of legalized segregation in the United States.
Prediction (NLG): jim crow is the name used to refer to the era of legalized segregation in the united states . ✓
(c) Question: average height nba player
Relevant Passage: The average height of an NBA player is around 6 feet 7 inches. The tallest NBA player ever was Gheorghe Muresan, who was 7 feet 7 inches tall. In contrast, the shortest NBA player ever was Tyrone Muggsy Bogues, who was 5 feet 3 inches tall.
Reference Answer (Q&A): Around 6 feet 7 inches
Prediction (Q&A): 6 feet 7 inches ✓
Reference Answers (NLG): The average height of NBA players is around 6 feet, 7 inches. / The height of NBA player is around 6 feet 7 inches.
Prediction (NLG): the average height of an national basketball association player is 6 feet 7 inches . ✓
(d) Question: population of henryetta ok
Relevant Passage: With it’s population of 5,927 people, Henryetta represented 14.79% of the county’s total. In addition, where the city of Henryetta has a population density of 891 people per square mile [344.2 people/km2], Okmulgee County averages 57.5 people per square mile [22.19 people/km2].
Reference Answer (Q&A): 5927
Prediction (Q&A): the population of henryetta , oklahoma is 5,927 . ✓ (content is OK, but style is NG ✗)
Reference Answer (NLG): The population of Henryetta, Oklahoma is 5,927.
Prediction (NLG): the population of henryetta , oklahoma is 5,927 . ✓
(e) Question: does gameplay programmer need math skill
Relevant Passage: A good computer programmer is more of a problem solver and logical thinker than a math buff. And besides, the industry is peppered with many successful computer programmers who do not really know much about mathematics.
Reference Answer (Q&A): No
Prediction (Q&A): yes ✗
Reference Answers (NLG): No, a gameplay programmer doesn’t need math skill. / No, gameplay programmer do not need an math skill.
Prediction (NLG): no , gameplay programmer does not need math skill . ✓
(f) Question: how long does a freezer take to cool down
Relevant Passage: Quick Answer. It takes anywhere from three to 24 hours for a refrigerator to reach safe temperatures for storing food, depending on the size and type of unit. When the refrigerator compartment reaches 40 degrees Fahrenheit and the freezer reaches 5 degrees Fahrenheit, it is safe to transfer food items. Keep Learning.
Reference Answer (Q&A): 24 hours
Prediction (Q&A): 4 to 5 hours ✗
Reference Answers (NLG): A freezer takes 24 hours to cool down. / A freezer take to cool down is 24 hours.
Prediction (NLG): a freezer takes 4 to 12 hours to cool down . ✗