Neural Models for Key Phrase Detection and Question Generation

Neural Models for Key Phrase Detection and Question Generation

Sandeep Subramanian Tong Wang Microsoft Maluuba Xingdi Yuan Microsoft Maluuba Saizheng Zhang Adam Trischler Microsoft Maluuba Yoshua Bengio MILA, Université de Montréal

We propose a two-stage neural model to tackle question generation from documents. Our model first estimates the probability that word sequences in a document compose “interesting” answers using a neural model trained on a question-answering corpus. We thus take a data-driven approach to interestingness. Predicted key phrases then act as target answers that condition a sequence-to-sequence question generation model with a copy mechanism. Empirically, our neural key phrase detection model significantly outperforms an entity-tagging baseline system and existing rule-based approaches. We demonstrate that the question generator formulates good quality natural language questions from extracted key phrases, and a human study indicates that our system’s generated question-answer pairs are competitive with those of an earlier approach. We foresee our system being used in an educational setting to assess reading comprehension and also as a data augmentation technique for semi-supervised learning.


Neural Models for Key Phrase Detection and Question Generation


Corresponding author ( Research conducted during internship at Microsoft Maluuba.

1 Introduction

Many educational applications can benefit from automatic question generation, including vocabulary assessment Brown, Frishkoff, and Eskenazi (2005), writing support Liu, Calvo, and Rus (2012), and assessment of reading comprehension Mitkov and Ha (2003); Kunichika et al. (2004). Formulating questions that test for certain skills at certain levels requires significant human effort that is difficult to scale, e.g., to massive open online courses (MOOCs). Despite their applications, the majority of existing models for automatic question generation rely on rule-based methods that likewise do not scale well across different domains and/or writing styles. To address this limitation, we propose and compare several neural models for automatic question generation.

We focus specifically on the assessment of reading comprehension. In this domain, question generation typically involves two inter-related components: first, a system to identify interesting entities or events (key phrases) within a passage or document Becker, Basu, and Vanderwende (2012); second, a question generator that constructs questions in natural language that ask specifically about the given key phrases. Key phrases thus act as the “correct” answers for generated questions. This procedure ensures that we can assess a student’s performance against a ground-truth target.

We formulate key phrase detection as modeling the probability of potential answers conditioned on a given document, i.e., . Inspired by successful work in question answering, we propose a sequence-to-sequence model that generates a set of key-phrase boundaries. This model can flexibly select an arbitrary number of key phrases from a document. To teach it to assign high probability to interesting answers, we train it on human-selected answers from large-scale, crowd-sourced question-answering datasets. We thus take a purely data-driven approach to the concept of interestingness, working from the premise that crowdworkers tend to select entities or events that interest them when they formulate their own comprehension questions. If this premise is correct, then the growing collection of crowd-sourced question-answering datasets Rajpurkar et al. (2016); Trischler et al. (2016) can be harnessed to learn models for key phrases of interest to human readers.

Given a set of extracted key phrases, we approach question generation by modeling the conditional probability of a question given a document-answer pair, i.e., . For this we use a sequence-to-sequence model with attention Bahdanau, Cho, and Bengio (2014) and the pointer-softmax mechanism Gulcehre et al. (2016). This component is also trained on a QA dataset by maximizing the likelihood of questions in the dataset.

Empirically, our proposed model for key phrase detection outperforms two baseline systems by a significant margin. We support these quantitative findings with qualitative examples of generated question-answer pairs given documents.

2 Related Work

2.1 Question Generation

Automatic question generation systems are often used to alleviate (or even eliminate) the burden of human generation of questions to assess reading comprehension Mitkov and Ha (2003); Kunichika et al. (2004). Various NLP techniques have been adopted in these systems to improve generation quality, including parsing Heilman and Smith (2010a); Mitkov and Ha (2003), semantic role labeling Lindberg et al. (2013), and the use of lexicographic resources like WordNet Miller (1995); Mitkov and Ha (2003). However, the majority of proposed methods resort to simple rule-based techniques such as slot-filling with templates Lindberg et al. (2013); Chali and Golestanirad (2016); Labutov, Basu, and Vanderwende (2015) or syntactic transformation heuristics Agarwal and Mannem (2011); Ali, Chali, and Hasan (2010) (e.g., subject-auxiliary inversion Heilman and Smith (2010a)). These techniques can be inadequate to capture the diversity of natural language questions.

To address this limitation, end-to-end-trainable neural models have recently been proposed for question generation in both vision Mostafazadeh et al. (2016) and language. For the latter, Du, Shao, and Cardie used a sequence-to-sequence model with an attention mechanism derived from the encoder states. Yuan et al. proposed a similar architecture but in addition improved model performance through policy gradient techniques. Wang, Yuan, and Trischler proposed a generative model that learns jointly to generate questions and answers based on documents.

2.2 Key Phrase Detection

Meanwhile, a highly relevant aspect of question generation is to identify which parts of a given document are important or interesting for asking questions. Existing studies formulate key phrase extraction as a two-step process. In the first step, lexical features (e.g., part-of-speech tags) are used to extract a key phrase candidate list of certain types Liu et al. (2011); Wang, Zhao, and Huang (2016); Le, Nguyen, and Shimazu (2016); Yang et al. (2017). In the second step, ranking models are often used to select a key phrase. Medelyan, Frank, and Witten; Lopez and Romary used bagged decision trees, while Lopez and Romary used a Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) to label the candidates in a binary fashion. Mihalcea and Tarau; Wan and Xiao; Le, Nguyen, and Shimazu scored key phrases using PageRank. Heilman and Smith asked crowdworkers to rate the acceptability of computer-generated natural language questions as quiz questions, and Becker, Basu, and Vanderwende solicited quality ratings of text chunks as potential gaps for Cloze-style questions.

These studies are closely related to our proposed work by the common goal of modeling the distribution of key phrases given a document. The major difference is that previous studies begin with a prescribed list of candidates, which might significantly bias the distribution estimate. In contrast, we adopt a dataset that was originally designed for question answering, where crowdworkers presumably tend to pick entities or events that interest them most. We postulate that the resulting distribution, learned directly from data, is more likely to reflect the true importance and appropriateness of answers.

Recently, Meng et al. proposed a generative model for key phrase prediction with an encoder-decoder framework, which is able to both generate words from vocabulary and point to words from document. Their model achieved state-of-the-art results on multiple scientific publication keyword extraction datasets. This model shares similar ideas to our key phrase extractor, i.e., using a single neural model to learn the probabilities that words are key phrases. 111In scientific publications, keywords might be absent in abstracts; whereas in SQuAD dataset, all answers are extractive spans in documents. Due to the different nature of two tasks, it is unfair to directly compare the two models by evaluating on either dataset.

Yang et al. used rule-based method to extract potential answers from unlabeled text, and then generated questions given documents and extracted answers using a pre-trained question generation model, and combined model-generated questions with human-generated questions for training question answering models. Experiments showed that question answering models can benefit from the augmented data provided by their approach.

3 Model Description

3.1 Key Phrase Detection

In this section, we describe a simple baseline as well as the two proposed neural models for extracting key phrases (answers) from documents.

3.1.1 Entity Tagging Baseline

Our baseline model (ENT) predicts all entities tagged by spaCy222 as key phrases. This is motivated by the fact that over 50% of the answers in SQuAD are entities (Table 2 in Rajpurkar et al. (2016)). These include dates (September 1967), numeric entities (3, five), people (William Smith), locations (the British Isles) and other entities (Buddhism).

3.1.2 Neural Entity Selection

The baseline model above naïvely selects all entities as candidate answers. One pitfall is that it exhibits high recall at the expense of precision (Table 1). We first attempt to address this with a neural entity selection model (NES) that selects a subset of entities from a list of candidates. Our neural entity selection model takes a document (i.e., a sequence of words) and a list of entities as a sequence of (start, end) locations within the document . The model is then trained on the binary classification task of predicting whether an entity overlaps with any of the gold answers.

Specifically, we maximize . We parameterize using a neural model that first embeds each word in the document as a distributed vector using a word embedding lookup table. We then encode the document using a bidirectional Long Short-Term Memory (LSTM) network into annotation vectors . We then compute using a three-layer multilayer perceptron (MLP) that takes as input the concatenation of three vectors . Here, and are the average and the final state of the annotation vectors, respectively, and is the average of the annotation vectors corresponding to the -th entity (i.e., ).

During inference, we select the top entities with highest likelihood as given by our model. We use in our experiments as determined by hyper-parameter search.

3.1.3 Pointer Networks

While a significant fraction of answers in SQuAD are entities, extracting interesting aspects of a document requires looking beyond entities. Many documents of interest may be entity-less, or sometimes an entity tagger may fail to recognize some of the entities. To this end, we build a neural model that is trained from scratch to extract all of the answer key phrases in a particular document. We parameterize this model as a pointer network Vinyals, Fortunato, and Jaitly (2015) that is trained to point sequentially to the start and end locations of all key phrase answers. As in our entity selection model, we first encode the document into a sequence of annotation vectors . A decoder LSTM is then trained to point to all of the start and end locations of answers in the document from left to right, conditioned on the annotation vectors, via an attention mechanism. We add a special termination token to the document, which the decoder is trained to attend on when it has generated all key phrases. This provides the flexibility to learn the number of key phrases the model should extract from a particular document. This is in contrast to the work of Meng et al. (2017), where a fixed number of key phrases was generated per document.

A pointer network is an extension of sequence-to-sequence models Sutskever, Vinyals, and Le (2014), where the target sequence consists of positions in the source sequence. We encode the source sequence document into a sequence of annotation vectors using an embedding lookup table followed by a bidirectional LSTM. The decoder also consists of an embedding lookup that is shared with the encoder followed by an unidirectional LSTM with an attention mechanism. We denote the decoder’s annotation vectors as , where is the number of answer key phrases, and correspond to the start and end annotation vectors for the first answer key phrase and so on. We parameterize and using the dot product attention mechanism Luong, Pham, and Manning (2015) between the decoder and encoder annotation vectors,

where is an affine transformation matrix. The inputs at each step of the decoder are words from the document that correspond to the start and end locations pointed to by the decoder.

During inference, we employ a decoding strategy that greedily picks the best location from the softmax vector at every step, then post process results to remove duplicate key phrases. Since the output sequence is relatively small, we observed similar performances when using greedy decoding and beam search.

3.2 Question Generation

The question generation model adopts a sequence-to-sequence framework Sutskever, Vinyals, and Le (2014) with an attention mechanism Bahdanau, Cho, and Bengio (2014) and a pointer-softmax decoder Gulcehre et al. (2016). It takes a document and an answer as input, and outputs a question .

An input word is first embedded by concatenating its word and character-level embeddings . Character-level information is captured with the final states of a BiLSTM on the character sequences of . The concatenated embeddings are subsequently encoded with another BiLSTM into annotation vectors .

To take better advantage of the extractive nature of answers in documents, we encode the answer by extracting the document encodings at the answer word positions. Specifically, we encode the hidden states of the document that correspond to the answer phrase with another condition aggregation BiLSTM. We use the final state of this as an encoding of the answer.

The RNN decoder employs the pointer-softmax module Gulcehre et al. (2016). At each step of the question generation process, the decoder decides adaptively whether to (a) generate from a decoder vocabulary or (b) point to a word in the source sequence (and copy over). The pointer-softmax decoder thus has two components - a pointer attention mechanism and a generative decoder.

In the pointing decoder, recurrence is implemented with two cascading LSTM cells and :


where and are the recurrent states, is the embedding of decoder output from the previous time step, and is the context vector (to be defined shortly in Equation (3)).

At each time step , the pointing decoder computes a distribution over the document word positions (i.e., a document attention, Bahdanau, Cho, and Bengio (2014)). Each element is defined as:

where is a two-layer MLP with tanh and softmax activation, respectively. The context vector used in Equation (2) is the sum of the document encoding weighted by the document attention:


The generative decoder, on the other hand, defines a distribution over a prescribed decoder vocabulary with a two-layer MLP :

The switch scalar at each time step is computed by a three-layer MLP :

The first two layers of use tanh activation and the final layer uses sigmoid. Highway connections are present between the first and the second layer.333We also attach the entropy of the softmax distributions to the input of the final layer, postulating that this guides the switching mechanism by indicating the confidence of pointing vs generating. We observed an improvement in question quality with this technique.

Finally, the resulting switch is used to interpolate the pointing and the generative probabilities for predicting the next word:

Validation Test
Models Prec. Rec. Prec. Rec.
H&S - - - 0.292 0.252 0.403
ENT 0.308 0.249 0.523 0.347 0.295 0.547
NES 0.334 0.335 0.354 0.362 0.375 0.380
PtrNet 0.352 0.387 0.337 0.404 0.448 0.387
ENT 0.187 0.127 0.491 0.183 0.125 0.479
PtrNet 0.452 0.480 0.444 0.435 0.467 0.427
Table 1: Model evaluation on key phrase extraction
Doc. inflammation is one of the first responses of the immune system to infection . the symptoms of inflammation are redness , swelling , heat , and pain , which are caused by increased blood flow into tissue . inflammation is produced by eicosanoids and cytokines , which are released by injured or infected cells . eicosanoids include prostaglandins that produce fever and the dilation of blood vessels associated with inflammation , and leukotrienes that attract certain white blood cells ( leukocytes )
by eicosanoids and cytokines — who is inflammation produced by ? of the first responses of the immune system to infection — what is inflammation one of ?
Q-A PtrNet
leukotrienes — what can attract certain white blood cells ? eicosanoids and cytokines — what are bacteria produced by ?
Q-A Gold SQuAD
inflamation — what is one of the first responses the immune system has to infection ? eicosanoids and cytokines — what compounds are released by injured or infected cells , triggering inflammation ?
Doc. research shows that student motivation and attitudes towards school are closely linked to student-teacher relationships . enthusiastic teachers are particularly good at creating beneficial relations with their students . their ability to create effective learning environments that foster student achievement depends on the kind of relationship they build with their students . useful teacher-to-student interactions are crucial in linking academic success with personal achievement . here , personal success is a student ’s internal goal of improving himself , whereas academic success includes the goals he receives from his superior . a teacher must guide his student in aligning his personal goals with his academic goals . students who receive this positive influence show stronger self-confidenche and greater personal and academic success than those without these teacher interactions .
research — what shows that student motivation and attitudes towards school are closely linked to student-teacher relationships ? useful teacher-to-student interactions — what are crucial in linking academic success with personal achievement ?
to student-teacher relationships — what does research show that student motivation and attitudes towards school are closely linked to ? that student motivation and attitudes towards school are closely linked to student-teacher relationships — what does research show to ?
Q-A PtrNet
student-teacher relationships — what are the student motivation and attitudes towards school closely linked to ? enthusiastic teachers — who are particularly good at creating beneficial relations with their students ?
teacher-to-student interactions — what is crucial in linking academic success with personal achievement ? a teacher — who must guide his student in aligning his personal goals ?
Q-A Gold SQuAD
student-teacher relationships — ’what is student motivation about school linked to ? beneficial — what type of relationships do enthusiastic teachers cause ?
aligning his personal goals with his academic goals . — what should a teacher guide a student in ? student motivation and attitudes towards school — what is strongly linked to good student-teacher relationships ?
Doc. the yuan dynasty was the first time that non-native chinese people ruled all of china . in the historiography of mongolia , it is generally considered to be the continuation of the mongol empire . mongols are widely known to worship the eternal heaven
the first time — what was the yuan dynasty that non-native chinese people ruled all of china ? the yuan dynasty — what was the first time that non-native chinese people ruled all of china ?
Q-A PtrNet
the mongol empire — the yuan dynasty is considered to be the continuation of what ? worship the eternal heaven — what are mongols widely known to do in historiography of mongolia ?
Q-A Gold SQuAD
non-native chinese people — the yuan was the first time all of china was ruled by whom ? the eternal heaven — what did mongols worship ?
Doc. on july 31 , 1995 , the walt disney company announced an agreement to merge with capital cities/abc for $ 19 billion . . in 1998 , abc premiered the aaron sorkin-created sitcom sports night , centering on the travails of the staff of a sportscenter-style sports news program ; despite earning critical praise and multiple emmy awards , the series was cancelled in 2000 after two seasons .
an agreement to merge with capital cities/abc for $19 billion — what did the walt disney company announce on july 31 , 1995 ? the walt disney company — what announced an agreement to merge with capital cities/abc for $19 billion on july 31 , 1995 ?
Q-A PtrNet
2000 — in what year was the aaron sorkin-created sitcom sports night cancelled ? walt disney company — who announced an agreement to merge with capital cities/abc for $ 19 billion ?
Q-A Gold SQuAD
july 31 , 1995 — when was the disney and abc merger first announced ? sports night — what aaron sorkin created show did abc debut in 1998 ?
Table 2: Qualitative examples of detected key phrases and generated questions.
Figure 1: A comparison of key phrase extraction methods. Red phrases are extracted by the pointer network, violet by H&S, green by the baseline, brown correspond to squad gold answers and cyan indicates an overlap between the pointer model and squad gold questions. The last paragraph is an exception where lyndon b. johnson and april 20 are extracted by H&S as well as the baseline model.

4 Experiments

4.1 Dataset

We conduct our experiments on the SQuAD Rajpurkar et al. (2016) and NewsQA Trischler et al. (2016) corpora. Both of these are machine comprehension datasets consisting of over 100k crowd-sourced question-answer pairs. SQuAD contains 536 paragraphs from Wikipedia while NewsQA was created on 12,744 news articles. Simple preprocessing is performed, including lower-casing and word tokenization using NLTK. The test split of SQuAD is hidden from the public, we therefore take 5,158 question-answer pairs (self-contained in 23 Wikipedia articles) from the training set as a validation set, and use the official development data to report test results. We use NewsQA only to evaluate our key phrase detection models in a transfer setting.

4.2 Implementation Details

All models were trained using stochastic gradient descent with a minibatch size of 32 using the ADAM optimization algorithm.

4.2.1 Key Phrase Detection

Key phrase detection models used pretrained word embeddings of 300 dimensions, generated using a word2vec extension Ling et al. (2015) trained on the English Gigaword 5 corpus. We used bidirectional LSTMs of 256 dimensions (128 forward and backward) to encode the document and an LSTM of 256 dimensions as our decoder in the pointer network model. A dropout of 0.5 was used at the outputs of every layer in the network.

4.2.2 Question Generation

In question generation, the decoder vocabulary uses the top 2000 words sorted by their frequency in the gold questions in the training data. The word embedding matrix is initialized with the 300-dimensional GloVe vectors Pennington, Socher, and Manning (2014). The dimensionality of the character representations is 32. The number of hidden units is 384 for both of the encoder/decoder RNN cells. Dropout is applied at a rate of 0.3 to all embedding layers as well as between the hidden states in the encoder/decoder RNNs across time steps.

4.3 Quantitative Evaluation of Key Phrase Extraction

Since each key phrase is itself a multi-word unit, we believe that a naiv̈e word-level F1 that considers an entire key phrase as a single unit is not well suited to evaluate these models. We thus propose an extension of the SQuAD F1 evaluation metric (for a single answer span) to multiple spans within a document called multi-span F1 score.

The metric is calculated as follows. Given the predicted phrase and a gold phrase , we first construct a pairwise, token-level score matrix of elements between the two phrases and . Max-pooling along the gold-label axis essentially assesses the precision of each prediction, with partial matches accounted for by the pair-wise F1 (identical to evaluation of a single answer in SQuAD) in the cells: . Analogously, recall for label can be defined by max-pooling along the prediction axis: . The multi-span F1 score is defined from the mean precision and recall :

Existing evaluations (e.g., that of Meng et al.) can be seen as the above computation performed on the matrix of exact match scores between predicted and gold key phrases. By using token-level F1 scores between phrase pairs, we allow fuzzy matches.

4.4 Human Evaluation of QA pairs

While key phrase extraction has a fairly well defined quantitative evaluation metric, evaluating generated text as in question generation is a harder problem. Instead of using automatic evaluation metrics such as BLEU, ROUGE, METEOR or CIDEr, we performed a human evaluation of our generated questions in conjunction with the answer key phrases.

We used two different evaluation approaches: an ambitious one that compares our generated question-answer pairs to human generated ones from SQuAD and another that compares our model with Heilman and Smith (2010a) (henceforth refered to as H&S).

Comparison to human generated questions - We presented annotators with documents from the SQuAD official development set and two sets of question-answer pairs, one from our model (machine generated) and the other from SQuAD (human generated). Annotators are tasked with identifying which of the question-answer pairs is machine generated. The order in which the question-answer pairs appear in each example is randomized. The annotators are free to use any criterion of their choice to make a distinction such as poor grammar, the answer phrase not correctly answering the generated question, uninteresting answer phrases, etc.

Implict comparison to H&S - To compare our system to existing methods (H&S)444 ark/mheilman/questions/, we use human generated SQuAD question-answer pairs to setup an implict comparison. Human annotators are presented with a document and two question-answer pairs – one that comes from the SQUAD official development set and another from either our system or H&S (at random). Annotators are not made aware of the fact that there are two different models generating QA pairs. The annotators are once again tasked with identifying which QA pair is “human” generated. We evaluate the accuracy with which annotators can distinguish human and machine when using both of these models.

Comparison to H&S - In a more direct evaluation strategy, we present annotators with documents from the SQuAD official development set but instead of a human generated question-answer pair, we use one generated by the H&S model and one from ours. We then ask annotators which one they prefer.

4.5 Results and Discussion

Our evaluation of the key phrase extraction systems is presented in Table 1. We compare answer phrases extracted by H&S, our baseline entity tagger, the neural entity selection module and the pointer network. As expected, the entity tagging baseline achieved the best recall, likely by over-generating candidate answers. The NES model, on the other hand, exhibits a much larger advantage in precision and consequently outperforms the entity tagging baseline by notable margins in F1. This trend persists in the comparison between the NES model and the pointer-network model. The H&S model exhibits high recall and lacks precision similar to the baseline entity tagger. This is not very surprising since the model hasn’t been exposed the SQuAD answer phrase distribution.

Qualitatively, we observe that the entity-based models have a strong bias towards numeric types, which often fail to capture interesting information in an article.

In addition, we also notice that the entity-based systems tend to select the central topical entity as the answer, which can contradict the distribution of interesting answers selected by humans. For example, given a Wikipedia article on Kenya and the fact agriculture is the second largest contributor to kenya ’s gross domestic product ( gdp ), the entity-based systems propose kenya as a key phrase and ask what country is nigeria ’s second largest contributor to ? 555Since the answer word kenya cannot appear in the generated question, the decoder produced a similar word nigeria instead. Given the same information, the pointer model picked agriculture as the answer and asked what is the second largest contributor to kenya ’s gross domestic product ?

Qualitative results with the question generation and key phrase extraction modules are presented in Table 2 and contrast H&S, our system, and human generated QA pairs from SQuAD.

H&S - Key phrases selected by this model appear to be different from the PtrNet and human generated ones; for example, they may start with prepositions such as “of”, “by” and “to” or be very large noun-phrases such as that student motivation and attitudes towards school are closely linked to student-teacher relationships. In addition, their key phrases as seen in Figure 1 (document 1) do not seem “interesting” and appear to contain somewhat arbitrary phrases such as “this theory”, “some studies”, “a person”, etc. Their question generation module appears to produce a few ungrammatical sentences, eg: the first time – what was the yuan dynasty that non-native chinese people ruled all of china ?

Our system - Since our key phrase extraction module was trained on SQuAD, the selected key phrases more closely resemble gold SQuAD answers. However, some of these answers don’t answer the questions generated about them, eg: eicosanoids and cytokines — what are bacteria produced by ? (first document in Table 2). Our model is sometimes able to effectively parse coreferent entities. eg: to generate the mongol empire -— the yuan dynasty is considered to be the continuation of what ? the model had to resolve the pronoun it to yuan dynasty in ”it is generally considered to be the continuation of the mongol empire” (third document in Table 2).

Comparison to human generated questions - We presented 14 annotators with a total of 740 documents, each containing 2 question-answer pairs. We observed that annotators were able to identify the machine generated question-answer pairs 77.8% of the time with a standard deviation of 8.34%.

Implict comparison to H&S - We presented 2 annotators with the same 100 documents, 45 of which come from our model and 55 from H&S, all examples are paired with SQuAD gold questions and answers. The first annotator labeled 30 (66.7%) correctly between ours and gold, labeled 45 (81.8%) correctly between H&S and gold; while the second annotator labeled 9 (20%) correctly between ours and gold, labeled 13 (23.6%) correctly between H&S and gold. Neither of the annotators has substantial prior knowledge of SQuAD dataset. The experiment shows both annotator have harder time distinguishing between our generated question-answer pairs with gold than H&S with gold.

Comparison to H&S - We presented 2 annotators with the same 200 examples, each of them contains the document with question-answer pairs generated from both our model and H&S’s model. The first annotator chose 107 (53.5%) question-answer pairs generated from our model as preferred choice, while the second annotator chose 90 (45%) from our model. This experiment shows that, without given the ground truth question-answer pairs, humans consider both models’ outputs to be equally good.

5 Conclusion

We proposed a two-stage framework to tackle the problem of question generation from documents. First, we use a question answering corpus to train a neural model to estimate the distribution of key phrases that are interesting to question-asking humans. We proposed two neural models, one that ranks entities proposed by an entity tagging system, and another that points to key-phrase start and end boundaries with a pointer network. When compared to an entity tagging baseline, the proposed models exhibit significantly better results.

We adopt a sequence-to-sequence model to generate questions conditioned on the key phrases selected in the framework’s first stage. Our question generator is inspired by an attention-based translation model, and uses the pointer-softmax mechanism to dynamically switch between copying a word from the document and generating a word from a vocabulary. Qualitative examples show that the generated questions exhibit both syntactic fluency and semantic relevance to the conditioning documents and answers, and appear useful for assessing reading comprehension in educational settings.

In future work we will investigate fine-tuning the complete framework end to end. Another interesting direction is to explore abstractive key phrase detection.


  • Agarwal and Mannem (2011) Agarwal, M., and Mannem, P. 2011. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, 56–64. Association for Computational Linguistics.
  • Ali, Chali, and Hasan (2010) Ali, H.; Chali, Y.; and Hasan, S. A. 2010. Automation of question generation from sentences. In Proceedings of QG2010: The Third Workshop on Question Generation, 58–67.
  • Bahdanau, Cho, and Bengio (2014) Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
  • Becker, Basu, and Vanderwende (2012) Becker, L.; Basu, S.; and Vanderwende, L. 2012. Mind the gap: learning to choose gaps for question generation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 742–751. Association for Computational Linguistics.
  • Brown, Frishkoff, and Eskenazi (2005) Brown, J. C.; Frishkoff, G. A.; and Eskenazi, M. 2005. Automatic question generation for vocabulary assessment. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 819–826. Association for Computational Linguistics.
  • Chali and Golestanirad (2016) Chali, Y., and Golestanirad, S. 2016. Ranking automatically generated questions using common human queries. In The 9th International Natural Language Generation conference, 217.
  • Du, Shao, and Cardie (2017) Du, X.; Shao, J.; and Cardie, C. 2017. Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106.
  • Gulcehre et al. (2016) Gulcehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; and Bengio, Y. 2016. Pointing the unknown words. arXiv preprint arXiv:1603.08148.
  • Heilman and Smith (2010a) Heilman, M., and Smith, N. A. 2010a. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 609–617. Association for Computational Linguistics.
  • Heilman and Smith (2010b) Heilman, M., and Smith, N. A. 2010b. Rating computer-generated questions with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 35–40. Association for Computational Linguistics.
  • Kunichika et al. (2004) Kunichika, H.; Katayama, T.; Hirashima, T.; and Takeuchi, A. 2004. Automated question generation methods for intelligent english learning systems and its evaluation. In Proc. of ICCE.
  • Labutov, Basu, and Vanderwende (2015) Labutov, I.; Basu, S.; and Vanderwende, L. 2015. Deep questions without deep understanding. In ACL (1), 889–898.
  • Le, Nguyen, and Shimazu (2016) Le, T. T. N.; Nguyen, M. L.; and Shimazu, A. 2016. Unsupervised keyphrase extraction: Introducing new kinds of words to keyphrases. 29th Australasian Joint Conference, Hobart, TAS, Australia, December 5-8, 2016.
  • Lindberg et al. (2013) Lindberg, D.; Popowich, F.; Nesbit, J.; and Winne, P. 2013. Generating natural language questions to support learning on-line. ENLG 2013 105.
  • Ling et al. (2015) Ling, W.; Dyer, C.; Black, A. W.; and Trancoso, I. 2015. Two/too simple adaptations of word2vec for syntax problems. In HLT-NAACL, 1299–1304.
  • Liu et al. (2011) Liu, Z.; Chen, X.; Zheng, Y.; and Sun, M. 2011. Automatic keyphrase extraction by bridging vocabulary gap. the Fifteenth Conference on Computational Natural Language Learning.
  • Liu, Calvo, and Rus (2012) Liu, M.; Calvo, R. A.; and Rus, V. 2012. G-asks: An intelligent automatic question generation system for academic writing support. D&D 3(2):101–124.
  • Lopez and Romary (2010) Lopez, P., and Romary, L. 2010. Humb: Automatic key term extraction from scientific articles in grobidp. the 5th International Workshop on Semantic Evaluation.
  • Luong, Pham, and Manning (2015) Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  • Medelyan, Frank, and Witten (2009) Medelyan, O.; Frank, E.; and Witten, I. H. 2009. Human-competitive tagging using automatic keyphrase extraction. Empirical Methods in Natural Language.
  • Meng et al. (2017) Meng, R.; Zhao, S.; Han, s.; He, D.; Brusilovsky, P.; and Chi, Y. 2017. Deep keyphrase generation. In the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • Mihalcea and Tarau (2004) Mihalcea, R., and Tarau, P. 2004. Textrank: Bringing order into text. Empirical Methods in Natural Language.
  • Miller (1995) Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
  • Mitkov and Ha (2003) Mitkov, R., and Ha, L. A. 2003. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing-Volume 2, 17–22. Association for Computational Linguistics.
  • Mostafazadeh et al. (2016) Mostafazadeh, N.; Misra, I.; Devlin, J.; Mitchell, M.; He, X.; and Vanderwende, L. 2016. Generating natural questions about an image. arXiv preprint arXiv:1603.06059.
  • Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.
  • Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  • Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112.
  • Trischler et al. (2016) Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; and Suleman, K. 2016. Newsqa: A machine comprehension dataset. 2nd Workshop on Representation Learning for NLP.
  • Vinyals, Fortunato, and Jaitly (2015) Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. In Advances in Neural Information Processing Systems, 2692–2700.
  • Wan and Xiao (2008) Wan, X., and Xiao, J. 2008. Single document keyphrase extraction using neighborhood knowledge. AAAI.
  • Wang, Yuan, and Trischler (2017) Wang, T.; Yuan, X.; and Trischler, A. 2017. A joint model for question answering and question generation. 1st Workshop on Learning to Generate Natural Language.
  • Wang, Zhao, and Huang (2016) Wang, M.; Zhao, B.; and Huang, Y. 2016. Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications. 23rd International Conference, ICONIP 2016.
  • Yang et al. (2017) Yang, Z.; Hu, J.; Salakhutdinov, R.; and Cohen, W. W. 2017. Semi-supervised qa with generative domain-adaptive nets. In the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • Yuan et al. (2017) Yuan, X.; Wang, T.; Gulcehre, C.; Sordoni, A.; Bachman, P.; Subramanian, S.; Zhang, S.; and Trischler, A. 2017. Machine comprehension by text-to-text neural question generation. 2nd Workshop on Representation Learning for NLP.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description