Difficulty Controllable Question Generation for Reading ComprehensionThis work was done when Yifan Gao and Jianan Wang were interns at Tencent AI Lab.

Difficulty Controllable Question Generation for Reading Comprehension1

Abstract

Question generation aims to generate natural language questions from a range of data sources such as free text and image. In this paper, we investigate the difficulty levels of questions, and propose a new task called Difficulty-controllable Question Generation (Dico-QG). Taking as input a reading comprehension paragraph and some text fragments (i.e. answers) in the paragraph that we want to ask about, a Dico-QG method needs to generate questions each of which has a given text fragment as its answer and is associated with a difficulty label. To solve this task, we proposed a two-step approach. The first step estimates what difficulty level of question could be generated for a given answer. After that, in the generation step, the estimated difficulty is employed together with other information as input to generate a question. For evaluation, we prepared the first dataset of reading comprehension questions with difficulty labels. The results show that our approach not only generates questions of better quality under the metrics like BLEU, but also has the capability of difficulty awareness to generate questions complying with the difficulty label.

\aclfinalcopy

1 Introduction

Question Generation (QG) aims to generate natural and human-like questions from a range of data sources, such as image Mostafazadeh et al. (2016), knowledge base Serban et al. (2016); Su et al. (2016), and free text Du et al. (2017). Besides for constructing SQuAD-like dataset Rajpurkar et al. (2016), QG is also helpful for intelligent tutor system: the tutor can actively ask the learner questions according to reading comprehension materials Heilman and Smith (2010) or its knowledge base Danon and Last (2017). In this paper, we focus on QG for reading comprehension text. For example, Figure 1 gives a reading comprehension paragraph and three questions, the goal of QG is to generate such questions.

Figure 1: An example from the SQuAD dataset. The answers of Q1 and Q2 are facts described in the paragraph, thus they are easy to answer. But to answer Q3, some reasoning capability is needed.

QG for reading comprehension is a challenging problem because it requires the selection of specific aspects from a paragraph to ask about, moreover, the generation should follow the syntactic structure of questions. Some template-based approaches Vanderwende (2007); Mazidi and Nielsen (2014); Lindberg et al. (2013); Becker et al. (2012); Heilman and Smith (2010) were proposed initially, where well-designed rules and heavy human labor are required for declarative-to-interrogative sentence transformation. With the rise of data-driven learning approach and the sequence to sequence (seq2seq) framework Sutskever et al. (2014), some researchers have formulated QG as a seq2seq problem Du et al. (2017): the question is regarded as the decoding target from the encoding information of its corresponding input sentence. However, apart from other seq2seq learning tasks like machine translation and summarization which could be loosely regarded as learning a one-to-one mapping, given a descriptive sentence, different aspects can be asked and hence, multiple questions that could be significantly different can be generated. Several recent works try to tackle this problem by incorporating the answer information to indicate what aspect to ask Hu et al. (2018); Song et al. (2018); Zhou et al. (2017).

In this paper, we advocate to generate questions in a difficulty-aware manner, which has not been investigated yet. Formally, we propose a new task, named Difficulty-aware Question Generation (Dico-QG). In the setting of this task, a framework takes as its input a reading comprehension paragraph and the text fragments (i.e. answers) in the paragraph that we want to ask questions about. The framework needs to generate questions of different difficulty levels, so that each text fragment corresponds to one question and acts as its answer, moreover, each question is associated with one difficulty label. Dico-QG is a task having rich application scenarios. In the actual teaching, when instructors prepare learning materials for students, they also want to balance the numbers of hard questions and easy questions. From the perspective of dataset preparation, generating difficulty-aware question and answer pairs can be used to test how well a QA system works for questions with diverse difficulty levels.

Our proposed framework has two modules: a difficulty estimator for the potential questions, and a difficulty-aware question generator. Given an answer, the difficulty estimator estimates what difficulty level of question could be asked in a straightforward way. For example in Figure 1, instead of “What is the atomic number of the element Oxygen?”, we humans can ask “What is the atomic number of the element right after Nitrogen in the periodic table?”. The later one is an indirect questioning way because the paragraph does not have the information that Oxygen is right after Nitrogen in the table. Such kind of questions with uncontrollable variants will not be discussed here. Taking the estimated difficulty, the question generator employs a characteristic-rich encoder-decoder architecture with attention and copy mechanisms integrated to generate questions for the answers.

For evaluation, we prepared the first dataset of reading comprehension questions with difficulty labels. Specifically, we design a method to automatically label the SQuAD questions, and obtain 76K questions with confident difficulty labels. In the quantitative evaluation, we compare our Dico-QG method with state-of-the-art models and ablation baselines, the results show that our model not only generates questions of better quality under the metrics like BLEU and ROUGE, but also has the capability of difficulty awareness for generating questions complying with the difficulty label. We will release the prepared dataset and the code of our model to facilitate other researchers to do further research along this line.

2 Related Work

In this section, we primarily review QG works based on free text. Vanderwende (2007) proposed this task and pointed out that question generation is an important but challenging task in NLP. Several rule-based approaches have been proposed since then. They usually manually design some question templates and transform the declarative sentences to interrogative questions Mazidi and Nielsen (2014); Labutov et al. (2015); Lindberg et al. (2013); Heilman and Smith (2010).

Du et al. (2017) proposed the first automatic QG framework. They view QG as a seq2seq learning problem to learn the mapping between sentences and questions in reading comprehension. Also, they propose to use automatic metrics in machine translation and summarization for evaluation. Later on, Du and Cardie (2017) proposed to identify the important sentences in the paragraph first, then ask questions accordingly. However, the procedure of QG from a sentence is not a one-to-one mapping, because given a sentence, different questions can be asked from different aspects. As Du et al. (2017) mentioned, in their dataset, each sentence corresponds to 1.4 questions on average. Seq2seq learning may not be capable of learning such a one-to-many mapping, and hence may not achieve decent results.

Figure 2: Overview of our Dico-QG framework (best viewed in color).

Some recent works attempt to solve this issue by assuming the aspect has been already known when asking a question Zhou et al. (2017); Subramanian et al. (2017); Yuan et al. (2017); Hu et al. (2018); Song et al. (2018) or can be detected with a third-party pipeline Du and Cardie (2018). This assumption makes sense, because for humans to ask questions, we usually first read the sentence to decide which aspect to ask. One way is to append an answer indicator to each word embedding to indicate if the current word is answer or not Zhou et al. (2017). Some other works employ a matching mechanism by fusing the answer information into the sentence first, then use the seq2seq model with attention and copy mechanism to generate the question Hu et al. (2018); Song et al. (2018). Bahuleyan et al. (2017) view this one-to-many mapping problem from another angle. They proposed a variant model of variational autoencoder to generate several diverse questions for a single sentence. The idea is novel but there is lacking widely accepted evaluation metric to quantitatively assess their proposed model. In this paper, we explore another important dimension in QG, namely question difficulty, that has never been studied before.

3 Task Definition and Our Approach

In the task of Difficulty-aware Question Generation (Dico-QG), our goal is to generate questions of diverse difficulty levels for a given paragraph. For now, we assume that the answers for asking questions are given, and they appear as text fragments in the paragraph by following the paradigm of SQuAD. We propose a two-step approach to handle Dico-QG. The first step predicts what difficulty level of question could be generated for a given answer. After that, in the generation step, the estimated difficulty is employed together with other information as input to generate a question.

Formally, let denote the input paragraph, denote the answers for asking questions, denote the sentences in each of which contains an answer. Mathematically, we first predict the difficulty level of the potential question for each answer as follows:

(1)

where is a set of predefined difficulty levels. Then for each answer, we generate a question with the predicted difficulty level considered. Thus, the goal is to find the word sequence (i.e., a question of arbitrary length) that maximizes the conditional likelihood given the sentence , the answer , and the difficulty level :

(2)

4 Framework Description

Our framework consists of two modules: the difficulty estimator for potential questions and the difficulty-aware question generator. For generating questions, the generator will consider the estimated difficulty level by the estimator. We first introduce the question generator, and delay the difficulty estimator part for a while.

4.1 Difficulty-Aware Question Generator

Given a sentence , a text fragment appearing in , and the estimated difficulty level of the potential question that can be asked with as its answer. The architecture of our difficulty-aware question generator is depicted in Figure 2. It first encodes the word embeddings of the sentence and the position indicator of the answer in with bidirectional LSTMs into contextualized representations. Then to make the decoder be aware of the difficulty of the potential question, the model employs the embedding of , retrieved from a learned difficulty embedding lookup table, to initialize the decoder.

Characteristic-Rich Encoder

We employ a characteristic-rich encoder to fuse the word semantic representations of and the position indicator of in . An embedding lookup table is firstly used to map the sentence into dense vectors: , where having dimensions. We assign binary labels to indicate if a token in is inside the answer or not. Then we map the binary labels into answer indicator embeddings: , where is the indicator embedding of , having dimensions. Such indicator embeddings lets our model know which aspect to ask, so as to generate a question to the point. We concatenate the word embedding and the answer indicator embedding of each to derive the characteristic-rich embedding . Then, the encoder takes as input. Specifically, we use bidirectional LSTMs to encode the sequence to get a contextualized representation for each token:

where and are the hidden states at the -th timestep of the forward and the backward LSTMs. We concatenate them together as .

Difficulty-Aware Decoder

We use another LSTMs as decoder to generate the question. We employ the estimated difficulty level to initialize the hidden state of the decoder. During the decoding, we incorporate the attention and copy mechanisms to enhance the performance.

Difficulty-Aware Initialization. To generate the difficulty-aware questions, we introduce a difficulty embedding vector to make the decoder be aware of the difficulty of potential questions. Recall that the estimated difficulty comes from a set of predefined difficulty levels. We first map to its corresponding difficulty embedding . Then we concatenate with the final hidden state of the encoder together to initialize the decoder hidden state .

Decoder with Attention & Copy. The decoder predicts the word probability distribution at each decoding timestep to generate the question. At t-th timestep, it reads the word embedding and the hidden state of the previous timestep to generate the current hidden state . Then the decoder employs the attention mechanism Luong et al. (2015) to calculate the context vector from the hidden states of the encoder:

(3)

where is the attention score, and . Then, the predicted probability distribution over the vocabulary at the current step is computed as:

(4)

, and are learnable parameters. To deal with the rare and unknown words, the decoder applies the pointing method See et al. (2017); Gu et al. (2016); Gulcehre et al. (2016) to allow copying a token from the input sentence at the -th decoding step. Here we reuse the attention score to derive the copy probability over the input tokens:

(5)

Then, a soft switch for combining and is computed as , where , , and are all learnable parameters, and is the sigmoid function. Eventually, we get the probability of predicting as the -th token of the question:

(6)

4.2 Difficulty Estimator

Given an answer , a sentence containing , and the context sentences (obtained by eliminating from ). This step aims at estimating the difficulty of the question that can be asked in a straightforward way and has as its answer. Our difficulty estimation method is inspired by how humans judge the difficulty of a question. Intuitively, a question would be harder if there exist more distractors in the context. One type of distraction is due to that some context sentences are similar to , which will increase the difficulty to identify for extracting the answer. Similarly, another type of distraction is due to the similarity with the answer itself. Moreover, some topics themselves are hard, e.g., science and philosophy.

The estimator consists of five layers: (1) the input layer, taking the wording embeddings as input; (2) the individual semantic layer, using CNN to map each of and into fix-length representations; (3) the similarity matching layer, finding the distractors in with respect to and ; (4) the interactive semantic layer, emphasizing the semantic of distractors; (5) the output layer, aggregating the information from the previous three layers for prediction.

The individual semantic layer employs CNN with max pooling to encode text fragments, i.e. and , of different length into fixed-dimension representations (refer to Kim (2014) for full details). Let denote such representations.

The similarity matching layer aims to find high-similarity context sentences from with respect to and , which are viewed as distractors when locating the correct sentence and extracting the answer. Let denote the representation for . The cosine similarity is calculated for with and respectively. For all sentences in , we get two matching vectors: .

The interactive semantic layer employs the above matching scores to emphasize the semantic of the distractors. Specifically, the context semantic representation emphasized by and can be computed as: , and , both and .

The output layer estimates the difficulty with the above three factors, namely , , and . Note that the last factor has a much smaller dimension size () than others. In order to consider them fairly, we apply a fully connected layer for each of factor to transfer them into the same dimension. Let , , and denote the transformed factors. Let . Then we pass to a fully connected layer to predict the difficulty label :

(7)

where and are learnable parameters.

Question difficulty estimation is a new research topic which has not been substantially investigated yet. Huang et al. (2017) tried to tackle this problem in the standard test scenarios (e.g. TOEFL or SAT). Their inputs include the paragraph, the question and its four answer choices, which is different from the setting in our framework.

5 Experiment for Question Generation

5.1 Dataset Preparation

SQuAD Rajpurkar et al. (2016) is a reading comprehension dataset containing 100,000+ questions on a set of Wikipedia articles, where the answer of each question is a text segment from the corresponding reading passage. Since the SQuAD questions do not have difficulty labels, we use two machine reading comprehension systems, namely R-Net Wang et al. (2017) 2 and BiDAF Seo et al. (2017) 3, to automatically label the questions4.

Specifically, we define two difficulty levels, namely, hard and easy. The labeling protocol is: a question would be labelled with easy if both R-Net and BiDAF answer it correctly under the exact match metric, and labelled with hard if both systems fail to answer it. The remaining questions are eliminated from our dataset for suppressing the ambiguity. 5 Finally, we obtain 44,723 easy questions, and 31,332 hard questions. The dataset is split according to articles, and Table 1 provides some data statistics. Across the training, validation and test sets, the splitting ratio is around 7:1:1, and the easy sample ratio is around 58% for all three.

\Xhline2 Train Dev Test
# easy questions 34813 4973 4937
# hard questions 24317 3573 3442
Easy ratio 58.88% 58.19% 58.92%
\Xhline2
Table 1: The statistics of our dataset.
\Xhline3 # examples # AVG skill
Easy Hard Easy Hard
BiDAF 64 36 1.20 1.44
R-Net 70 30 1.11 1.7
Our protocol 55 21 1.16 1.81
\Xhline3
Table 2: Number of skills needed to answer easy and hard question.

To verify the reasonability of our labeling protocol, we evaluate its consistency with human being’s judgment. Sugawara et al. (2017) manually labelled 100 questions by analyzing the skills needed to answer each question correctly. In total, they defined 13 skills, such as coreference resolution and elaboration. Table 2 shows the skill numbers needed for answering the easy and hard questions according to BiDAF, R-Net and our labeling protocol. Note that for BiDAF and R-Net, the total number of questions is 100, but for our protocol, 76 questions are left after filtering out the ambiguous ones. For our protocol, to answer the hard questions, 1.81 skills are required on average, while for the easy questions, 1.16 skills are required. This shows that the labeling protocol is basically consistent with the human intuition, i.e. answering hard questions requires more skills.

5.2 Model Details and Parameter Settings

For the question generator, we use the GloVe word embeddings 6 for initialization and fine tune them in the training. The embedding dimensions for answer indicator and difficulty level, i.e. and , are set to 50 and 100 respectively. We set the number of layers of LSTMs to 2 in both encoder and decoder, and the LSTMs hidden unit size is set to 600. We use dropout Srivastava et al. (2014) with probability . All trainable parameters, except word embeddings, are randomly initialized with . For optimization in the training, we use Adadelta as the optimizer with minibatch size of 64 for all baselines and our model. We adopt teacher-forcing in the encoder-decoder training and use the ground truth difficulty labels. Beam search with size 3 is employed for question generation in the testing procedure. All important hyper-parameters, such as and , are selected on validation dataset.

5.3 Evaluation Metric

We evaluate the generated questions from two perspectives. First we employ BLEU (B), METEOR (MET) and ROUGE-L (R-L) scores by following Du et al. (2017) to evaluate the similarity between our generated questions and the ground truth questions. Furthermore, we run R-Net and BiDAF to assess the difficulty of our generated hard and easy questions. Here the R-Net and BiDAF systems are trained using the same train/validation splits as shown in Table 1, and we report their performance under the standard reading comprehension measures for SQuAD questions, i.e. Exact Match (EM) and macro-averaged F1 score (F1), on the easy and hard question sets respectively.

5.4 Baselines

\Xhline3 B1 B2 B3 B4 MET R-L
Att 39.63 22.75 14.96 10.33 15.37 37.73
Att+Ans 41.04 25.32 17.44 12.46 17.72 41.45
Att+Ans+Diff(G) 41.57 26.02 18.08 12.99 18.05 41.98
Att+Ans+Diff(P) 41.41 25.83 17.93 12.88 17.97 41.80
Copy+Att 40.57 25.50 18.09 13.38 17.65 41.27
Copy+Att+Ans 43.50 29.09 21.39 16.24 20.61 45.82
Copy+Att+Ans+Diff(G) 43.75 29.28 21.53 16.38 20.81 46.01
Copy+Att+Ans+Diff(P) 44.05 29.62 21.90 16.72 20.83 45.86
\Xhline3
Table 3: Quality of generated questions, evaluated against the gold questions.
\Xhline3 Easy Questions Set Hard Questions Set
R-Net EM R-Net F1 BiDAF EM BiDAF F1 R-Net EM R-Net F1 BiDAF EM BiDAF F1
Gold Split Copy+Att+Ans 82.15 87.54 73.52 82.16 34.12 60.04 24.84 52.70
Copy+Att+Ans+Diff 85.85 90.04 76.77 83.95 30.87 56.75 21.85 50.46
Pred Split Copy+Att+Ans 74.41 82.76 66.02 77.05 32.53 61.33 22.51 54.44
Copy+Att+Ans+Diff 76.35 84.10 68.29 78.62 30.59 59.09 21.73 52.31
\Xhline3
Table 4: Quality of the generated questions, measured with R-Net and BiDAF. For easy questions, higher score indicates better difficulty-awareness, while for hard questions, lower indicates better.

For question generation, we employ Du et al. (2017) as a baseline, which models the question generation as a seq2seq problem incorporating attention mechanism, thus we refer to it as Att. Att+Ans adds answer indicator embedding to Seq2Seq model, similar to Zhou et al. (2017). Att+Ans+Diff is our framework which considers difficulty information in the generation. All above models are further experimented with copy mechanism used. For all experiments, we show the difficulty-aware question generation performance by feeding both ground truth (Gold) difficulty labels and predicted (Pred) difficulty labels.

5.5 Results and Analysis

Main Results

Table 3 shows the quality of generated questions measured with BLEU, METEOR and ROUGE-L against the gold questions. Comparing the two groups of results, using copy mechanism clearly outperforms not using, which is because questions usually borrow some words from the input sentence as the context. Att+Ans significantly outperforms Att, no matter using copy or not. It demonstrates that the answer information is essential for QG to generate the question to the point. Att+Ans+Diff is slightly better than Att+Ans, which shows that incorporating embedding features related to hard and easy levels can help the generator improve the quality of questions. “Diff(G)” and “Diff(P)” respectively indicate feeding the ground truth or the predicted difficulty labels into the generator. “Diff(G)” performs slightly better when not using copy, but when using, “Diff(P)” performs slightly better. So in general, their performances are very similar.

Recall that the generated questions can be split into an easy set and a hard set, no matter according to the gold difficulty labels or the predicted labels. Now we evaluate the generated questions from another more interesting perspective: a reading comprehension system (e.g., R-Net and BiDAF) should perform better on the generated questions in the easy set, and perform worse on the hard question set. Such evaluation is given in Table 4, which shows the pipelines with copy and answer considered, because they are found very useful to generate questions similar to the ground truth. Moreover, if a pipeline does not use the answer information, its generated questions are very likely not about the answers, therefore, both BiDAF and R-Net cannot work well no matter for easy or hard questions.

Table 4 shows that for both gold and predicted splits, for the easy set, the questions generated by using difficulty labels (i.e. the “easy” label) are easier to answer, i.e. Copy+Att+Ans+Diff achieves higher performance than Copy+Att+Ans. While for the hard set, the questions generated by Copy+Att+Ans+Diff is more difficult to answer, i.e. lower performance. This observation is pretty interesting, and it shows that incorporating the difficulty embeddings indeed guides the generator to generate easier or harder questions. Moreover, using the gold difficulty labels (i.e. Gold Split), all gaps between the two pipelines are larger than using the predicted labels (i.e. Pred Split), which shows that using the gold labels can make the generator better aware of what difficulty level the generated questions should have (since the predicted labels are not 100% correct, refer to the results in Section 6).

Case Study

Figure 3: Example questions generated by human, Copy+Att+Ans baseline, and our Copy+Att+Ans+Diff model.
\Xhline3 Acc. Easy Hard Macro-F1
P R F1 P R F1
RANDOM 50.07 59.02 49.88 54.07 41.20 50.34 45.31 49.69
Majority Baseline 58.91 58.91 100.0 74.14 - - - 37.07
CNN Kim (2014) 71.85 71.88 85.98 78.30 71.8 51.48 59.97 69.13
Ours 73.42 72.25 89.28 79.87 76.59 50.55 60.91 70.39
Ours (w/o sim-match) 72.82 71.57 89.55 79.56 76.37 48.70 59.47 69.52
Ours (w/o inter-semantic) 72.75 72.43 86.94 79.03 73.52 52.29 61.11 70.07
Ours (w/o indiv-semantic) 73.24 72.14 89.10 79.72 76.21 50.37 60.66 70.19
\Xhline3
Table 5: Results of difficulty estimation model.

Figure 3 provides some example questions generated by human, Copy+Att+Ans baseline model, and our model for two input data with the gold difficulty label “hard”. The questions generated by our model are more human-like and harder than the baseline questions: our questions are more abstract and use less continuous text fragments from the input sentences. For the baseline questions, the reader can answer them correctly by simply checking which tokens are not mentioned in the question, even without reading and understanding the whole sentences. Our questions, however, contain key information of the whole input sentence, but use less input tokens even than the human generated questions.

The performance of R-Net and BiDAF on these questions validates our above analysis. Both R-Net and BiDAF fail to answer the human questions and our questions, but succeed to answer the baseline questions. It shows that our model is aware of the hard difficulty level and generates hard questions accordingly. Moreover, we also feed the opposite difficulty label, i.e. easy, into our model to see what questions the model will generate. Interestingly, our model still performs decently, showing its awareness of the easy label, and generates questions appropriately by using more words from the input sentence to make the questions easier.

Effect of Embedding Dimensions

We investigate the effect of answer indicator embedding dimension and difficulty embedding dimension via grid search on optional values {2, 10, 20, 50, 100, 200}. Figure 4 shows the heatmaps of BLEU2 and BLEU4 on the validation dataset. We can see that the best model takes and , which indicates that the dimensions should be neither too small nor too large to avoid underfitting or overfitting.

(a) BLEU2
(b) BLEU4
Figure 4: Dimension selection for answer/difficulty embedding.

6 Experiment for Difficulty Estimation

6.1 Experiment Settings

For question difficulty estimation part, we employ commonly-used evaluation metrics: accuracy, precision, recall and F1 measure. We report the performance of three baseline models. The Random baseline assigns a random label to each case. The Majority baseline labels all cases with “easy”, i.e. the majority in our dataset. The CNN baseline is the classical sentence classification framework Kim (2014), which here incorporates a sentence and an answer as input. In addition, we also examine the performance of three ablations by removing each of the three middle layers from our model. The parameters specific to the CNN part for both the CNN baseline and our model are set as follows: filter windows are (3, 5, 7), and each has 50 feature maps. Again Adadelta is used as the optimizer, and training termination is decided using early stopping technique based on the performance on the validation set.

6.2 Results and Discussions

The overall results are given in the Table 5. Compare to CNN baselines, our model largely performs better under most of the metrics. Note that CNN baseline is equal to our model using individual semantic layer alone. Therefore, this result indicates the layers of interactive semantic and similarity matching are effective. On the other hand, removing the individual semantic layer, i.e. “Ours (w/o indiv-semantic)”, has little effect on the full model. When removing the similarity matching layer, the recall of hard questions drops dramatically to 48.70 (1.85 ), which indicates that “Ours (w/o sim-match)” cannot recognize sufficient distracting information from the context, resulting in some hard cases are mistaken as “easy”. It thus testifies the usefulness of the matching layer.

7 Conclusions and Future Work

We presented a novel task: difficulty-aware question generation for reading comprehension, to the best of our knowledge, which has never been studied before. We proposed a two-step approach to tackle this task: the difficulty estimator estimates the potential questions that could be asked, and the question generator generates difficulty-aware questions. We also prepared the first dataset for this task, and extensive experiments show that our framework can solve this task reasonably well. For the future work, one direction is to couple the two steps more tightly so that the generation signal also helps the difficulty prediction. Another direction is to explore generating multiple questions for different aspects in one sentence.

Footnotes

  1. thanks: This work was done when Yifan Gao and Jianan Wang were interns at Tencent AI Lab.
  2. https://github.com/HKUST-KnowComp/R-Net
  3. https://github.com/allenai/bi-att-flow
  4. It is worth to mention that there are some more powerful reading comprehension systems on the SQuAD leaderboard, but these systems are not open sourced.
  5. We divide the original SQuAD questions into 9 splits, and use a 9-fold strategy to label each single split with 7 splits as the training data and the last split as the validation data.
  6. https://nlp.stanford.edu/projects/glove/

References

  1. Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. 2017. Variational attention for sequence-to-sequence models. CoRR, abs/1712.08207.
  2. Lee Becker, Sumit Basu, and Lucy Vanderwende. 2012. Mind the gap: Learning to choose gaps for question generation. In HLT-NAACL.
  3. Guy Danon and Mark Last. 2017. A syntactic approach to domain-specific automatic question generation. CoRR, abs/1712.09827.
  4. Xinya Du and Claire Cardie. 2017. Identifying where to focus in reading comprehension for neural question generation. In EMNLP, pages 2067–2073.
  5. Xinya Du and Claire Cardie. 2018. Harvesting paragraph-level question-answer pairs from wikipedia. In Association for Computational Linguistics (ACL).
  6. Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In ACL.
  7. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1631–1640.
  8. Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 140–149.
  9. Michael Heilman and Noah A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617.
  10. Wenpeng Hu, Bing Liu, Jinwen Ma, Dongyan Zhao, and Rui Yan. 2018. Aspect-based question generation. In ICLR Workshop.
  11. Zhenya Huang, Qi Liu, Enhong Chen, Hongke Zhao, Mingyong Gao, Si Wei, Yu Su, and Guoping Hu. 2017. Question difficulty prediction for reading problems in standard tests. In AAAI.
  12. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
  13. Igor Labutov, Sumit Basu, and Lucy Vanderwende. 2015. Deep questions without deep understanding. In ACL.
  14. David Lindberg, Fred Popowich, John C. Nesbit, and Philip H. Winne. 2013. Generating natural language questions to support learning on-line. In ENLG.
  15. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP.
  16. Karen Mazidi and Rodney D. Nielsen. 2014. Linguistic considerations in automatic question generation. In ACL.
  17. Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating natural questions about an image. ACL.
  18. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
  19. Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1073–1083.
  20. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. ICLR.
  21. Iulian Serban, Alberto García-Durán, ?aglar Gül?ehre, Sungjin Ahn, A. P. Sarath Chandar, Aaron C. Courville, and Yoshua Bengio. 2016. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. ACL.
  22. Linfeng Song, ZhiguoWang, Wael Hamza, Yue Zhang, and Daniel Gildea. 2018. Leveraging context information for natural question generation. In NAACL Short.
  23. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research.
  24. Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, and Xifeng Yan. 2016. On generating characteristic-rich question sets for qa evaluation. In EMNLP.
  25. Sandeep K Subramanian, Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. Neural models for key phrase detection and question generation. CoRR, abs/1706.04560.
  26. Saku Sugawara, Yusuke Kido, Hikaru Yokono, and Akiko Aizawa. 2017. Evaluation metrics for machine reading comprehension: Prerequisite skills and readability. In ACL.
  27. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. neural information processing systems, pages 3104–3112.
  28. Lucy Vanderwende. 2007. Answering and questioning for machine reading. In AAAI Spring Symposium: Machine Reading.
  29. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 189–198.
  30. Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Saizheng Zhang, Sandeep Subramanian, and Adam Trischler. 2017. Machine comprehension by text-to-text neural question generation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 15–25, Vancouver, Canada. Association for Computational Linguistics.
  31. Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. CoRR, abs/1704.01792.
215421
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description