A Unified Query-based Generative Model for Question Generation and Question Answering

# A Unified Query-based Generative Model for Question Generation and Question Answering

Linfeng Song, Zhiguo Wang    Wael Hamza
Department of Computer Science, University of Rochester, Rochester, NY 14627
IBM T.J. Watson Research Center, Yorktown Heights, NY 10598

## Introduction

Recently both question generation and question answering tasks are receiving increasing attention from both the industrial and academic communities. The task of question generation (QG) is to generate a fluent and relevant question given a passage and a target answer, while the task of question answering (QA) is to generate a correct answer given a passage and a question. Both tasks have massive industrial values: QA has been used in industrial products such as search engines, while QG is helpful for improving QA systems by automatically increasing the training data. It can also be used to generate question for educational purposes such as language learning.

We first pretrain the model with the cross-entropy loss, then fine tune with policy-gradient reinforcement learning to alleviate the exposure bias problem, resulted by sequence learning with the cross-entropy loss. In our policy-gradient reinforcement learning algorithm, we adopt a similar sampling strategy as the scheduled sampling strategy (Bengio et al., 2015) for generating the sampled output. We perform experiments on the SQuAD dataset (Rajpurkar et al., 2016) for the QG task, and on the “description” subset of the MS-MARCO (Nguyen et al., 2016) dataset for the generative QA task. Experimental results on the QG task show that our model outperforms previous state-of-the-arts methods, and the automatically generated questions can even improve an extractive QA system. For the generative QA task, our model shows better performances than other generative systems.

## Model

Figure 1 shows the architecture of our model. The model takes two components as input: a passage of length , and a query of length , then generates the output sequence word by word. Specifically, the model follows the encoder-decoder framework. The encoder matches each time-step of the passage against all time-steps of the query from multiple perspectives, and encodes the matching result into a “Multi-perspective Memory”. In addition, the decoder generates the output sequence one word at a time based on the “Multi-perspective Memory”.

### Multi-Perspective Matching Encoder

The left-hand side of Figure 1 depicts the architecture of our encoder. Its goal is to perform comprehensive understanding of the query and the passage. The encoder first represents all words within the passage and the query with word embeddings (Mikolov et al., 2013). In order to incorporate contextual information into the representation of each time-step of the passage or the query, we utilize a bi-directional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) layer to encode the passage and the query individually:

 ←hqi =LSTM(←−−hqi+1,qi) →hqi =LSTM(−−→hqi−1,qi) ←hpj =LSTM(←−−hpj+1,pj) →hpj =LSTM(−−→hpj−1,pj),

where and are embedding of the -th word in the query and the -th word in the passage. Then, the contextual vectors for each time-step of the query and the passage are constructed by concatenating the outputs from the BiLSTM layer: and .

We utilize a matching layer on top of the contextual vectors to match each time-step of the passage with all time-steps of the query. Apparently, this is the most crucial layer in our encoder. Inspired by Wang, Hamza, and Florian (2017), we adopt the multi-perspective matching method for the matching layer. We define four matching strategies, as shown in Figure 2, to match the passage with the query from multiple granularities.

(1) Full-Matching. As shown in Figure 2 (a), each forward (or backward) contextual vector of the passage is compared with the last time-step of the forward (or backward) representation of the query.

(2) Maxpooling-Matching. As shown in Figure 2 (b), each forward (or backward) contextual vector of the passage is compared with every forward (or backward) contextual vectors of the query, and only the maximum value of each dimension is retained.

(3) Attentive-Matching. As shown in Figure 2 (c), we first calculate the cosine similarities between each forward (or backward) contextual vector of the passage and every forward (or backward) contextual vectors of the question. Then, we take the cosine similarities as the weights, and calculate an attentive vector for the entire query by weighted summing all the contextual vectors of the query. Finally, we match each forward (or backward) contextual vector of the passage with its corresponding attentive vector.

(4) Max-Attentive-Matching. As shown in Figure 2 (d), this strategy is similar to the Attentive-Matching strategy. However, instead of taking the weighed sum of all the contextual vectors as the attentive vector, we pick the contextual vector with the highest cosine similarity as the attentive vector. Then, we match each contextual vector of the passage with its new attentive vector.

These four match strategies require a function to match two vectors. Theoretically, any functions for matching two vectors would work here. Inspired by Wang et al. (2016), we adopt the multi-perspective cosine matching function defined as:

 m=fm(v1,v2;W),

where and are -dimensional input vectors, is the learnable parameter of multi-perspective weight, and is the number of perspectives. Each row represents the weights associated with one perspective, and the similarity according to that perspective is defined as:

 mk=cos(Wk∘v1,Wk∘v2),

where is the element-wise multiplication operation. So represents the matching results between and from all perspectives. Intuitively, each perspective calculates the cosine similarity between two input vectors, and it is associated with a weight vector trained to highlight different dimensions of the input vectors. This can be regarded as considering different part of the semantics captured in the vector.

The final matching vector for each time-step of the passage is the concatenation of the matching results of all four strategies. We also employ another BiLSTM layer on top of the matching layer to smooth the matching results. We concatenate the contextual vectors, , of the passage and matching vectors to be the Multi-perspective Memory , which contains both the passage information and the matching information.

### LSTM Decoder

The right-hand side of Figure 1 is our decoder. Basically, it is an attention-based LSTM model (Bahdanau, Cho, and Bengio, 2014) with copy and coverage mechanisms. The decoder takes the “Multi-perspective Memory” as the attention memory, and generates the output one word at a time.

Concretely, while generating the -th word , the decoder considers five factors as the input: (1) the “Multi-perspective Memory” , where each vector aligns to the -th word in the passage; (2) the previous hidden state of the LSTM model ; (3) the embedding of previously generated word ; (4) the previous context vector , which is calculated from the attention mechanism with being the attentional memory; and (5) the previous coverage vector , which is the accumulation of all attention distributions so far. When , we initialize , and as zero vectors, and fix to be the embedding of the sentence start token “<s>”.

For each time-step , the decoder first feeds the concatenation of the previous word embedding and context vector into the LSTM model to update the hidden state:

 st =LSTM(st−1,[xt−1,ct−1])

Second, the attention distribution for each time-step of the “Multi-perspective Memory” is calculated with the following equations:

 et,i =vTetanh(Whhi+Wsst+Wuut−1+be) αt,i =exp(et,i)∑Nj=1exp(et,j)

where , , , and are learnable parameters. The coverage vector is then updated by . And the new context vector is calculated via:

 ct=N∑i=1αt,ihi

Then, the output probability distribution over a vocabulary of words at the current state is calculated by:

 Pvocab=softmax(V2(V1[st,ct]+b1)+b2),

where , , and are learnable parameters. The number of rows in represents the number of words in the vocabulary.

On top of the LSTM decoder, we adopt the copy mechanism (Gulcehre et al., 2016; Gu et al., 2016; See, Liu, and Manning, 2017) to integrate the attention distribution into the final vocabulary distribution. The probability distribution is defined as the interpolation between two probability distributions:

 Pfinal=gtPvocab+(1−gt)Pattn,

where is the switch for controlling generating a word from the vocabulary or directly copying it from the passage. is the generating probability distribution as defined above, and is calculated based on the attention distribution by merging probabilities of duplicated words. Intuitively, is relevant to the current decoder state, the attention results and the input. Therefore, inspired by See, Liu, and Manning (2017), we define it as:

 gt=σ(wTcct+wTsst+wTxxt−1+bg),

where vectors , , and scalar are learnable parameters.

## Policy Gradient Reinforcement Learning via Scheduled Sampling

A common way of training a sequence generation model is to optimize the log-likelihood of the gold-standard output sequence with the cross-entropy loss:

 lce=−T∑t=1logp(y∗t|y∗t−1,...,y∗0,X;θ),

where is the model input, and represents the trainable model parameters.

However, this method suffers from two main issues. First, during the training stage, the ground-truth of the previous word is taken as the input to predict the probabilities of the next word . But, in the testing stage, the ground-truth is not available, and the model has to rely on the previously generated word . If the model selected a different than the ground-truth , then the following generated sequence could deviate from the gold-standard sequence. This issue is known as the “exposure bias problem”. Second, models trained with the cross-entropy loss are optimized for the log-likelihood of a sequence which is different from the evaluation metrics.

In this work, we utilize a reinforcement learning method to address the exposure bias problem and directly optimize the evaluation metrics. Concretely, we adopt the “REINFORCE with a baseline” algorithm (Williams, 1992), a well-known policy-gradient reinforcement learning algorithm, to train our model, because it has shown the effectiveness for several sequence generation tasks (Paulus, Xiong, and Socher, 2017; Rennie et al., 2016). Formally, the loss function is defined as:

 lrl=(r(^Y)−r(Ys))T∑t=1logp(yst|yst−1,...,ys0,x;θ),

where is the sampled sequence, is the sequence generated from a baseline, and the function is the reward calculated based on the evaluation metric. Intuitively, the loss function enlarges the log-probability of the sampled sequence , if is better than the baseline in terms of the evaluation metric , or vice versa. In this work, for the QG task, we use the BLEU score (Papineni et al., 2002) as the reward, and for the QA task, we use the ROUGE score (Lin, 2004) as the reward.

Following Rennie et al. (2016), we take the greedy search result from the current model as the baseline sequence . Rennie et al. (2016) generated the sampled sequence according to the probability distribution of . However, this sampling strategy doesn’t work well for our tasks. One possible reason is that our tasks have much larger search space. Inspired by Bengio et al. (2015), we designed a new “Scheduled Sampling” strategy to construct the sampled sequence from both the gold-standard sequence and the greedy search sequence . As shown in Algorithm 1, it goes through the gold-standard sequence word by word (Line 2), and replaces with the corresponding word from the greedy search sequence with probability (Line 4-8). If the greedy search sequence is shorter than the gold-standard sequence, the ground-truth word is used after exceeding the end of the greedy search sequence (Line 10). Our experiments show that sampling the sequence according to the model distribution, as Rennie et al. (2016) does, usually produces outputs worse than the greedy search sequence, so it does not help very much. On the other hand, our sampling strategy usually generates better outputs than the greedy search sequence.

## Experimental Setup

We conduct experiments on two tasks: question generation (QG) and generative question answering (QA).

Question Generation For the QG task, we evaluate the quality of generated questions with some automatic evaluation metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), as well as their effectiveness in improving an extractive QA system. We conduct experiments on the SQuAD dataset (Rajpurkar et al., 2016) by comparing our model with Du, Shao, and Cardie (2017) and Zhou et al. (2017) in terms of BLEU, METEOR (Banerjee and Lavie, 2005) and ROUGE. The dataset contains 536 articles and over 100k questions related to the articles. Here, we follow Du, Shao, and Cardie (2017) and Zhou et al. (2017) to conduct experiments on the accessible part as our entire dataset. Since Du, Shao, and Cardie (2017) and Zhou et al. (2017) conducted their experiments using different training/dev/test split, we conduct experiments on both splits, and compare with their reported performance.

In addition, we also evaluate our model from a more practical aspect by examining whether the automatically generated questions are helpful for improving an extractive QA system. We use the data split of Du, Shao, and Cardie (2017), and conduct experiments on low-resource settings, where only (10%, 20%, or 50%) of the human-labeled questions in the training data are available. For example, in the 10% setting, we first train our QG model with the 10% available training data, then generate questions for the remaining 90% instances in the training data, where the human-labeled questions are abandoned.111We assume the gold answers are available when generating questions for the remaining 90% instances, and leave automatic answer selection as future work, since the primary goal here is to evaluate the quality of automatically generated questions. Finally, we train an extractive QA system with the 10% human-labeled questions and the 90% automatically generated questions. The extractive QA system we choose is Wang et al. (2016), but our framework does not make any assumptions about the extractive QA systems being used.

Generative QA For this task, we conduct experiments on the MS MARCO dataset (Nguyen et al., 2016), which contains around 100k queries and 1M passages. The designing purpose of this dataset is to generate the answer given the top 10 returned documents from a search engine, and the answer is not necessary in the documents. Even though the answers in this dataset are human generated rather than extracted from candidate documents, we found that the answers of around 66% questions can be exactly matched in the passage, and a large number of the remaining answers just have a small difference with the content in the passages.222They are generated by dropping or paraphrasing one span from the supporting sentence (containing answer) in the passage. Among all types of questions (“numeric”, “entity”, “location”, “person” and “description”), the “description” subset has the most percentage of answers that can not be exactly matched in the passage. Therefore, for the generative QA experiments, we follow Nguyen et al. (2016) to conduct experiments on the “description” subset, and compare with their reported results.

For both tasks, our model is first trained for 15 epochs with the cross-entropy loss, then fine-tuned for 15 epochs using our policy gradient algorithm. Adam (Kingma and Ba, 2014) is used for parameter optimization, and the learning rate is set to and for cross entropy and policy gradient phases respectively. The encoder and decoder share the same pre-trained word embeddings, which are the 300-dimensional GolVe (Pennington, Socher, and Manning, 2014) word vectors pre-trained from the 840B common crawl corpus, and the embeddings are not updated during training. For all experiments, the flip probability is set 0.1, the number of perspectives is set 5, and the weight for the coverage loss is set 0.1. For all experiments, the model yielding the best performance on the dev set is picked for evaluation on the test set.

## Experimental Results

### Question Generation

We compare our model with Du, Shao, and Cardie (2017) and Zhou et al. (2017) on the question generation task, and show the results in Table 1. Since Zhou et al. (2017) adopts rich features (such as name entity tags and part-of-speech tags), we re-implement a version without these rich features (w/o rich feature) for fair comparison. We also implement two versions of our model: (1) MPQG is our model only trained with the cross-entropy loss, and (2) MPQG+R is our model fine-tuned with the policy gradient reinforcement learning algorithm after pretraining.

First, our MPQG model outperforms the comparing systems on both data splits, which shows the effectiveness of our multi-perspective matching encoder. Our MPQG model, which only takes word features, shows even better performance than the feature-rich (POS tag and NER) system of Zhou et al. (2017). Du, Shao, and Cardie (2017) utilized the sequence-to-sequence model to take the passage as input and then generate the questions, where they entirely ignored the target answer. Therefore, the generated questions are independent of the target answer. Zhou et al. (2017) hard-coded the target answer positions into the passage, and employed the sequence-to-sequence model to consume the position-encoded passages, then generate the questions. This method only considered the target answer positions, but neglected the relations between the target answer and other parts of the passage. If the target answer does not literally occur in the passage, this method will shrivel. Conversely, our MPQG model matches the target answer against the passage from multiple perspectives. Therefore, it can capture more interactions between the target answer and the passage, and result in more suitable question for the target answer.

Second, our MPQG+R model works better than the MPQG model on both splits, showing the effectiveness of our policy gradient training algorithm.

To better illustrate the advantage of our model, we show some comparative results of different models in Table 2, where the Baseline system is our implementation of Zhou et al. (2017) without rich features. Generally, our MPQG model generates better questions than Zhou et al. (2017). Taking the first case as an example, the baseline fails to recognize that “1856” is the year when “nikola tesla” is born, while our MPQG learns that from the pattern “day month year - day month year”, which frequently occurs in the training data. For the third case, the baseline fails to generate the correct output as the query is very long and complicated. On the other hand, our MPQG model is able to capture that, because it performs comprehensive matching between the target answer and the passages. In addition, our MPQG+R model fixes some small mistakes of MPQG by directly optimizing the evaluation metrics, such as the second case in Table 2.

### Question Generation for Extractive QA

Table 3 shows the results on improving an extractive QA system with automatically generated questions. Here F1 measures the overlap between the prediction and the reference in terms of bags of tokens, and exact match (EM) measures the percentage where the prediction is identical to the reference (Rajpurkar et al., 2016). The baseline is trained only on the part where gold questions are available, while the others are trained on the combination of the gold questions and the automatically generated questions, but with different methods of generating questions: (1) w/ window, a strong baseline from Yang et al. (2017), uses the previous and the following 5 tokens of the target answer as the pseudo question, and (2) w/ MPQG+R generates questions with our MPQG+R model.

First, we can see that w/ MPQG+R outperforms the baseline under all settings in terms of both F1 and EM scores, especially under the 10% setting, where we observe 3 and 5 points gains in terms of F1 and EM scores. This shows the effectiveness of our model. Second, the comparing results between w/ MPQG+R and w/ window show that the improvements of w/ MPQG+R are not due to simply enlarging the training data, but because of the higher quality of the generated questions. Yang et al. (2017) showed that w/ window can significantly improve their baseline, while it is not true in our experiment. One reason could be that our baseline is much stronger than theirs, such as we achieves 50.54% EM score under 10% setting, while they only got an EM score of 24.92%.

### Generative QA

For the generative QA experiment, we compare our model with the generative models in Nguyen et al. (2016) on the “description” subset of MS-MARCO dataset. Table 4 shows the corresponding performance. Among all the comparing methods, Best Passage selects the best passage in terms of the ROUGE-L score, and obviously it accesses the reference. Passage Ranking ranks the passage by a deep structured semantic model of Huang et al. (2013). Sequence to Sequence is a vanilla sequence-to-sequence model (Sutskever, Vinyals, and Le, 2014). Memory Network adopts the end-to-end memory network (Sukhbaatar et al., 2015) as the encoder, and a vanilla RNN model as the decoder. We also implement a baseline system “vanilla-cosine”, which only apply the vanilla cosine similarity for the matching function in our encoder, and is only trained with the cross-entropy loss.

First, we can see that our MPQG+R model outperforms all other systems by a large margin, and is close to Best Passage, even though Best Passage accesses the reference. Besides, our MPQG model outperforms the vanilla-cosine model showing the effectiveness of our multi-perspective matching encoder. Finally, MPQG+R outperforms MPQG by around 1.4 ROUGE-L points, showing the effectiveness of our policy-gradient learning strategy.

## Related Work

For question generation (QG), our work extends previous work (Du, Shao, and Cardie, 2017; Zhou et al., 2017; Yang et al., 2017; Subramanian et al., 2017; Tang et al., 2017; Wang, Yuan, and Trischler, 2017; Yuan et al., 2017) by performing query understanding. Tang et al. (2017); Yang et al. (2017) joins the QG task with the QA task, but they still conduct the QG task. The only difference is that they directly optimize the QA performance rather than a general metric (such as BLEU). On the other hand, our model can conduct both tasks of QG and QA.

For question answering (QA), most previous works (Wang and Jiang, 2016; Wang et al., 2016; Shen et al., 2016; Wang et al., 2017; Chen et al., 2017; Xiong, Zhong, and Socher, 2016; Seo et al., 2016; Lee et al., 2016; Yu et al., 2016; Dhingra et al., 2017) focus on the extractive QA scenario, which predicts a continuous span in the passage as the answer. Obviously, they rely on the assumption that the answer can be exactly matched in the passage. On the other hand, our model performs generative QA, which generates the answer word-by-word, and does not rely on this assumption. The generative QA is valuable for studying, as we can not guarantee the assumption being true for all scenarios. Tan et al. (2017) claims to perform generative QA, but it still relies on an extractive QA system by generating answers from the extractive results. One notable exclusion is Yin et al. (2015), which generate factoid answers from a knowledge base (KB). One significant difference is that their method matches the query against a KB, whereas ours performs matching against unstructured texts. Besides, we leverage policy gradient learning to alleviate the exposure bias problem, which they also suffer from.

## Conclusion

In this paper, we introduced a query-based generative model, which can be used on both question generation and question answering. Following the encoder-decoder framework, a multi-perspective matching encoder is designed to perform query and passage understanding, and an LSTM model with coverage and copy mechanisms is leveraged as the decoder to generate the target sequence. In addition, we leverage a policy gradient learning algorithm to alleviate the exposure bias problem, which generative models suffer from when training with the cross-entropy loss. Experiments on both question generation and question answering tasks show superior performances of our model, which outperforms the state-of-the-art models. From the results we conclude that query understanding is important for question generation, and that policy gradient is effective on tackling the exposure bias problem resulted by training with cross-entropy loss.

## References

• Bahdanau, Cho, and Bengio (2014) Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
• Banerjee and Lavie (2005) Banerjee, S., and Lavie, A. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
• Bengio et al. (2015) Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS 2015, 1171–1179.
• Chen et al. (2017) Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of ACL 2017.
• Dhingra et al. (2017) Dhingra, B.; Liu, H.; Yang, Z.; Cohen, W.; and Salakhutdinov, R. 2017. Gated-attention readers for text comprehension. In Proceedings of ACL 2017.
• Du, Shao, and Cardie (2017) Du, X.; Shao, J.; and Cardie, C. 2017. Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106.
• Gu et al. (2016) Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of ACL 2017.
• Gulcehre et al. (2016) Gulcehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; and Bengio, Y. 2016. Pointing the unknown words. In Proceedings of ACL 2017.
• Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
• Huang et al. (2013) Huang, P.-S.; He, X.; Gao, J.; Deng, L.; Acero, A.; and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of CIKM 2013, 2333–2338.
• Kingma and Ba (2014) Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
• Lee et al. (2016) Lee, K.; Salant, S.; Kwiatkowski, T.; Parikh, A.; Das, D.; and Berant, J. 2016. Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436.
• Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop. Barcelona, Spain.
• Mi et al. (2016) Mi, H.; Sankaran, B.; Wang, Z.; and Ittycheriah, A. 2016. A coverage embedding model for neural machine translation. In EMNLP 2016.
• Mikolov et al. (2013) Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
• Nguyen et al. (2016) Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
• Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002.
• Paulus, Xiong, and Socher (2017) Paulus, R.; Xiong, C.; and Socher, R. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
• Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP 2014, 1532–1543.
• Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP 2016, 2383–2392.
• Rennie et al. (2016) Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563.
• See, Liu, and Manning (2017) See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
• Seo et al. (2016) Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
• Shen et al. (2016) Shen, Y.; Huang, P.-S.; Gao, J.; and Chen, W. 2016. Reasonet: Learning to stop reading in machine comprehension. arXiv preprint arXiv:1609.05284.
• Subramanian et al. (2017) Subramanian, S.; Wang, T.; Yuan, X.; and Trischler, A. 2017. Neural models for key phrase detection and question generation. arXiv preprint arXiv:1706.04560.
• Sukhbaatar et al. (2015) Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In NIPS 2015, 2440–2448.
• Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS 2014, 3104–3112.
• Tan et al. (2017) Tan, C.; Wei, F.; Yang, N.; Lv, W.; and Zhou, M. 2017. S-net: From answer extraction to answer generation for machine reading comprehension. arXiv preprint arXiv:1706.04815.
• Tang et al. (2017) Tang, D.; Duan, N.; Qin, T.; and Zhou, M. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027.
• Tu et al. (2016) Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811.
• Wang and Jiang (2016) Wang, S., and Jiang, J. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
• Wang et al. (2016) Wang, Z.; Mi, H.; Hamza, W.; and Florian, R. 2016. Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211.
• Wang et al. (2017) Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of ACL 2017.
• Wang, Hamza, and Florian (2017) Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral multi-perspective matching for natural language sentences. In IJCAI 2017.
• Wang, Yuan, and Trischler (2017) Wang, T.; Yuan, X.; and Trischler, A. 2017. A joint model for question answering and question generation. arXiv preprint arXiv:1706.01450.
• Williams (1992) Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
• Xiong, Zhong, and Socher (2016) Xiong, C.; Zhong, V.; and Socher, R. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604.
• Yang et al. (2017) Yang, Z.; Hu, J.; Salakhutdinov, R.; and Cohen, W. W. 2017. Semi-supervised qa with generative domain-adaptive nets. arXiv preprint arXiv:1702.02206.
• Yin et al. (2015) Yin, J.; Jiang, X.; Lu, Z.; Shang, L.; Li, H.; and Li, X. 2015. Neural generative question answering. arXiv preprint arXiv:1512.01337.
• Yu et al. (2016) Yu, Y.; Zhang, W.; Hasan, K.; Yu, M.; Xiang, B.; and Zhou, B. 2016. End-to-end answer chunk extraction and ranking for reading comprehension. arXiv preprint arXiv:1610.09996.
• Yuan et al. (2017) Yuan, X.; Wang, T.; Gulcehre, C.; Sordoni, A.; Bachman, P.; Subramanian, S.; Zhang, S.; and Trischler, A. 2017. Machine comprehension by text-to-text neural question generation. arXiv preprint arXiv:1705.02012.
• Zhou et al. (2017) Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; and Zhou, M. 2017. Neural question generation from text: A preliminary study. arXiv preprint arXiv:1704.01792.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters