Improving Machine Reading Comprehension with General Reading Strategies

Improving Machine Reading Comprehension with General Reading Strategies


Reading strategies have been shown to improve comprehension levels, especially for readers lacking adequate prior knowledge. Just as the process of knowledge accumulation is time-consuming for human readers, it is resource-demanding to impart rich general domain knowledge into a language model via pre-training Radford et al. (2018); Devlin et al. (2018). Inspired by reading strategies identified in cognitive science, and given limited computational resources — just a pre-trained model and a fixed number of training instances — we therefore propose three simple domain-independent strategies aimed to improve non-extractive machine reading comprehension (MRC): (i) Back and Forth Reading that considers both the original and reverse order of an input sequence, (ii) Highlighting, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and (iii) Self-Assessment that generates practice questions and candidate answers directly from the text in an unsupervised manner.

By fine-tuning a pre-trained language model Radford et al. (2018) with our proposed strategies on the largest existing general domain multiple-choice MRC dataset RACE, we obtain a absolute increase in accuracy over the previous best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target task, leading to new state-of-the-art results on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, MultiRC, SemEval-2018, and ROCStories). These results indicate the effectiveness of the proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate the strategies.

1 Introduction

Recently we have seen increased interest in machine reading comprehension (MRC) Rajpurkar et al. (2016); Choi et al. (2018); Kočiskỳ et al. (2018); Reddy et al. (2018). In this paper, we mainly focus on non-extractive MRC Richardson et al. (2013); Khashabi et al. (2018); Ostermann et al. (2018); Clark et al. (2018). Given a reference document or corpus, answering questions requires diverse reading skills, and the majority of candidate answers are non-extractive (Section 2.2). Compared to extractive MRC tasks (Section 2.1), the performance of machine readers on these tasks more accurately indicates the comprehension ability of machines in realistic settings Lai et al. (2017).

Similar to the process of knowledge accumulation for human readers, imparting massive amounts of general domain knowledge from external corpus into a high-capacity language model is time-consuming and resource-demanding. For example, it takes about one month to pre-train a -layer transformer Liu et al. (2018) using eight GPUs over 7,000 books Radford et al. (2018). A very recent study Devlin et al. (2018) claims that they pre-train a -layer transformer using TPUs for four days over the same book corpus and English Wikipedia ( one year to train on eight most advanced GPUs such as P100s), which is almost non-reproducible considering the tremendous computational resources.

The utilization of reading strategies has been shown effective in improving comprehension levels of human readers, especially those who lack adequate prior knowledge of the topic of the text Sheorey and Mokhtari (2001); McNamara et al. (2004). From a practical viewpoint, given a limited number of training instances and a pre-trained model, which we can regard as a human reader with fixed prior knowledge, can we also apply domain-independent strategies to improve the reading comprehension levels of machine readers?

Inspired by reading strategies of human readers identified in cognitive science research Sheorey and Mokhtari (2001); Mokhtari and Sheorey (2002); Mokhtari and Reichard (2002), based on an existing pre-trained transformer Radford et al. (2018) (Section 3.1), we propose three corresponding domain-independent strategies as follows.

  • Back and Forth Reading (“I go back and forth in the text to find relationships among ideas in it.”):
    considering both the original and reverse order of an input sequence (Section 3.2)

  • Highlighting (“I highlight information in the text to help me remember it.”):
    adding a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers (Section 3.3)

  • Self-Assessment (“I ask myself questions I like to have answered in the text, and then I check to see if my guesses about the text are right or wrong.”):
    generating practice questions and their associated span-based candidate answers from the existing reference documents (Section 3.4)

By fine-tuning a pre-trained transformer Radford et al. (2018) with our proposed strategies on the largest general domain multiple-choice machine reading comprehension dataset RACE Lai et al. (2017) that are collected from language exams (Section 4.2), we obtain an accuracy of (passing performance), a absolute improvement over the previous best result achieved by the same pre-trained transformer fine-tuned on RACE without the use of strategies. We further fine-tune the resulting high-performing model on a target task. Experiments show that our method achieves new state-of-the-art results on six representative non-extractive machine reading comprehension datasets that require a range of skills such as commonsense and multiple-sentence reasoning (i.e., ARC Clark et al. (2018), OpenBookQA Mihaylov et al. (2018), MCTest Richardson et al. (2013), MultiRC Khashabi et al. (2018), SemEval-2018 Yang et al. (2017), and ROCStories Mostafazadeh et al. (2016)) (Section 4.3). These results indicate the effectiveness of our proposed strategies and the versatility and generality of our fine-tuned models that incorporate these strategies.

2 Task Introduction

We roughly categorize machine reading comprehension tasks into two groups: extractive (Section 2.1) and non-extractive (Section 2.2) based on the expected answer types.

RACE ARC OpenBookQA MCTest MultiRC SemEval-2018 Task 11 ROCStories
construction method exams exams exams crowd. crowd. crowd. crowd.
sources of documents general science science stories 7 domains narrative text stories
average # of answer options 4.0 4.0 4.0 4.0 5.4 2.0 2.0
# of documents 27,933 14M 1,326 660 871 13,939 3,742
# of questions 97,687 7,787 5,957 2,640 9,872 2,119
non-extractive answer (%) 87.0 43.3 83.8 45.3 82.1 89.9 100.0
Table 1: Statistics of multiple-choice machine reading comprehension datasets. Some values come from \newcitereddy2018coqa, \newcitekovcisky2018narrativeqa, and \newcitelai2017race (crowd.: crowdsourcing; : regarding each sentence/claim as a document Clark et al. (2018); : correct answer options that are not text snippets from reference documents).

2.1 Extractive MRC

Recently several large-scale extractive machine reading comprehension datasets have been constructed Hermann et al. (2015); Hill et al. (2016); Onishi et al. (2016); Chen and Choi (2016); Mostafazadeh et al. (2016); Bajgar et al. (2016); Nguyen et al. (2016); Joshi et al. (2017); Ma et al. (2018), such as SQuAD Rajpurkar et al. (2016) and NewsQA Trischler et al. (2017). Given a reference document and a question, the expected answer is a short span from the document. Considering the answer type limitations in these datasets, answers in datasets such as MS MARCO Nguyen et al. (2016), SearchQA Dunn et al. (2017), and NarrativeQA Kočiskỳ et al. (2018) are human generated based on given documents Reddy et al. (2018); Choi et al. (2018). Since annotators tend to copy spans as answers directly, the majority of answers are still extractive Reddy et al. (2018). For example, of questions in SQuAD Reddy et al. (2018) and of questions in NarrativeQA expect extractive answers Kočiskỳ et al. (2018). Given informative factoid questions, state-of-the-art attention-based neural models Wang et al. (2018b); Hu et al. (2018) designed for extractive MRC tasks have already achieved very high performance based on the local context.

2.2 Multiple-Choice MRC Datasets

For multiple-choice MRC datasets, given a question and a reference document/corpus, multiple answer options are provided. There is at least one correct answer option. We primarily discuss the non-extractive datasets, in which answer options are not restricted to extractive text spans. Building a multiple-choice dataset by crowdsourcing (e.g., MCTest Richardson et al. (2013), MultiRC Khashabi et al. (2018), and SemEval-2018 Task 11 Ostermann et al. (2018)) involves extensive human involvement in designing questions and answer options. Besides crowdsourcing, datasets such as RACE Lai et al. (2017), ARC Clark et al. (2018), and OpenbookQA Mihaylov et al. (2018) are collected from language or science examinations designed by educational experts Penas et al. (2014); Shibuki et al. (2014); Tseng et al. (2016), which aim to test the comprehension level of human participants. Compared to questions in extractive MRC tasks, besides surface matching, there are various types of complicated questions such as math word problems, summarization, logical reasoning, and sentiment analysis, requiring advanced reading skills such as reasoning over multiple sentences and the use of prior world knowledge. We can adopt more objective evaluation criteria such as accuracy Clark et al. (2016); Lai et al. (2017). As this kind of datasets are relatively difficult to construct or collect, most of the existing datasets are small in size, which hinders the utilization of state-of-the-art deep neural models. In this paper, we explore how to fully make use of the limited resources to improve machine reading comprehension.

We summarize six popular representative multiple-choice MRC datasets in Table 1. As shown in the table, most of the correct answer options are non-extractive. Except for MultiRC, there is exactly one correct answer option for each question. For ARC and OpenBookQA, a reference corpus is provided instead of a single reference document associated with each question.

Here we give a formal task definition. Let , , and denote the text of document, question, and answer options respectively. Given , the task is to select the correct answer options from associated with .

3 Approach

Figure 1: Framework Overview. Strategy 1, 2, and 3 refer to back and forth reading (Section 3.2), highlighting (Section 3.3), and self-assessment (Section 3.4), respectively.

In this section, we first introduce a pre-trained transformer used as our neural reader (Section 3.1) and then elaborate our strategies for enhancements. For convenience, we borrow the names of the three reading strategies mentioned in the introduction for use as the names of our strategies: back and forth reading (Section 3.2), highlighting (Section 3.3), and self-assessment (Section 3.4).

3.1 Framework Overview

We use the OpenAI fine-tuned transformer language model (OFT) Radford et al. (2018) as a neural reader. It adapts a pre-trained multi-layer transformer Vaswani et al. (2017); Liu et al. (2018) language model to a labeled dataset , where each instance consists of a sequence of input tokens , along with a label , by maximizing:


where is the likelihood from the language model, is the weight of language model, and is obtained by a linear classification layer over the final transformer block’s activation of the language model. For multiple-choice MRC tasks, come from the concatenation of a start token, document, question, a delimiter token, answer option, and an end token; indicates if an answer option is correct. We refer the reader to \newciteradfordimproving for more details.

Apart from placing delimiters to separate document, question, and answer option from one another, the original OFT framework pays less attention to the task structure in MRC tasks. Inspired by previous research on human reading strategies, with limited resources and the pre-trained transformer, we propose three strategies to improve machine reading comprehension. We show the whole framework in Figure 1.

3.2 Back and Forth Reading

For simplicity, we represent the input sequence of the original OFT as where , , and represent start token, delimiter token, and end token respectively. In our framework, we consider both the original order and its reverse order . The token order within , , or is still preserved. We train two OFTs that use and as the input sequence respectively, and then ensemble the two models. We will discuss other similar pairs of input sequences such as and in the experiment (Section 4.4).

3.3 Highlighting

In the original OFT, the text embedding of a document is independent of the context of questions and answer options. Inspired by highlights used in human reading, we aim to make document encoding be aware of the context of a given (question q, answer option ) pair. We focus on content words (nouns, verbs, adjectives, adverbs, numerals, and foreign words) in questions and answer options since they appear to provide more useful information Baroni and Zamparelli (2010); Mirza and Bernardi (2013).

Formally, we let be the set of part of speech (POS) tags of content words. We let denote the sequence of the text embedding of document . represents the token in , and the text embedding of is denoted by . Given and a (, ) pair, we define a highlight embedding for the token in as:


where and are two trainable vectors of the same dimension as .

Following the above definition, the sequence of the highlight embedding is of the same length as . We replace in the original OFT with when we encode a document. More specifically, we use the concatenation of , , , , , and as the new input of the pre-trained transformer (Section 3.1), where , , and denote the embedding of start token, delimiter token, and end token, respectively, and and represent the sequence of the text embedding of and , respectively.

Approach # of Ensemble Models RACE-Middle RACE-High RACE-All
Previous SOTA:
BiAttention MRU Tay et al. (2018) 9 60.2 50.3 53.3
OFT Radford et al. (2018) - 62.9 57.4 59.0
Baselines (Our Implementations):
OFT - 60.9 57.8 58.7
OFT 2 62.6 58.4 59.6
OFT 9 63.5 59.3 60.6
Single Strategy:
Self-Assessment (SA) - 63.2 59.2 60.4
Highlighting (HL) - 67.4 61.5 63.2
Back and Forth Reading (BF) 2 67.3 60.7 62.6
Strategy Combination
SA + HL - 69.2 61.5 63.8
SA + HL + BF 2 70.9 63.2 65.4
SA + HL + BF 9 72.0 64.5 66.7
Amazon Turker Lai et al. (2017) - 85.1 69.4 73.3
Table 2: Accuracy () on the test set of RACE.

3.4 Self-Assessment

In the previous work Radford et al. (2018), the original pre-trained transformer is directly fine-tuned on an end task. Inspired by the self-assessment reading strategy, we propose a simple method to generate questions and their associated multiple span-based answer options, which cover the content of multiple sentences from a reference document. By first fine-tuning the pre-trained model on these “practice” instances, we aim to render the new resulting fine-tuned model more input structure aware and to integrate information across multiple sentences required to answer a given question.

Concretely, we randomly generate no more than questions and associated answer options based on each document from the end task (i.e., RACE in this paper). We describe the steps as follows.

  • Input: a document from the end task, which still serves as the reference document.

  • Output: a question and four answer options associated with the reference document.

  • Randomly pick no more than sentences from the document, and concatenate these sentences together.

  • Randomly pick no more than non-overlapping spans from the concatenation of sentences, each of which randomly contains no more than word tokens. We remove the selected spans, which are regarded as the correct answer option, from the concatenated sentences and use the remaining sentences as the question.

  • Generate three distractors (i.e., wrong answer options) by randomly replacing spans in the correct answer option with randomly picked spans from the reference document.

where , , , and are used to control the number and difficulty level of the generated questions.

Target: ARC
Approach Source Ensemble Accuracy (Easy | Challenge)
Previous SOTA IR Clark et al. (2018) - - 62.6 | 20.3
ET-RR Ni et al. (2018) - N/A | 36.6
Our Baseline OFT - 57.0 | 38.2
2 57.1 | 38.4
Our Approach Strategies - 66.6 | 40.7
2 68.9 | 42.3
Target: OpenBookQA
Approach Source Ensemble Accuracy
Previous SOTA Odd-One-Out Solver Mihaylov et al. (2018) - - 50.2
Our Baseline OFT - 52.0
2 52.8
Our Approach Strategies - 55.2
2 55.8
Target: MCTest
Approach Source Ensemble Accuracy (MC160 | MC500)
Previous SOTA Finetuned QACNN Chung et al. (2018) - 76.4 | 68.7
- 73.8 | 72.3
Our Baseline OFT - 65.4 | 61.5
2 65.8 | 61.0
Our Approach Strategies - 80.0 | 78.7
2 81.7 | 82.0
Target: MultiRC
Approach Source Ensemble | |
Previous SOTA LR Khashabi et al. (2018) - - 66.5 | 63.2 | 11.8
Our Baseline OFT - 69.3 | 67.2 | 15.2
2 70.3 | 67.7 | 16.5
Our Approach Strategies - 71.5 | 69.2 | 22.6
2 73.1 | 70.5 | 21.8
Target: SemEval-2018 Task 11
Approach Source Ensemble Accuracy
Previous SOTA TriAN Wang (2018) - 81.9
9 84.0
HMA Chen et al. (2018) - - 80.9
- 7 84.1
Our Baseline OFT - 88.0
2 88.6
Our Approach Strategies - 88.8
2 89.5
Target: ROCStories
Approach Source Ensemble Accuracy
Previous SOTA OFT Radford et al. (2018) - - 86.5
Our Baseline OFT - 87.1
2 87.5
Our Approach Strategies - 88.0
2 88.3
Table 3: Performance (%) on the test sets of ARC, OpenBookQA, MCTest, SemEval-2018, and ROCStories and the development set of MultiRC (: macro-average F1; : micro-average F1; : exact match accuracy). Approaches marked by ✓use RACE as the source task, except that ET-RR Ni et al. (2018) uses essential terms Khashabi et al. (2017) and Finetuned QACNN Chung et al. (2018) uses MovieQA Tapaswi et al. (2016).
Task ARC OpenBookQA MCTest MultiRC SemEval ROCStories Average
Easy | Chllenge - MC160 | MC500 - - - -
Metric Accuracy Accuracy Accuracy | | Accuracy Accuracy Accuracy
OFT 54.0 | 30.3 50.0 58.8 | 52.0 69.3 | 66.2 | 11.9 87.3 86.7 53.9
OFT () 53.9 | 30.7 50.0 60.0 | 54.0 69.3 | 66.5 | 13.1 88.0 87.0 54.6
Strategies 61.9 | 35.0 54.2 67.5 | 64.7 68.8 | 67.4 | 16.2 87.6 87.4 59.3
Strategies () 63.1 | 35.4 55.0 70.8 | 64.8 69.7 | 67.9 | 16.9 88.1 88.1 60.3
Table 4: Performance (%) on the test sets of ARC, OpenBookQA, MCTest, SemEval-2018 Task 11, and ROCStories and the development set of MultiRC using the target data only (i.e., without the data flow 1 and 2 boxed in Figure 1). (: macro-average F1; : micro-average F1; : exact match accuracy).

4 Experiment

4.1 Experiment Settings

For most of the hyperparameters, we follow the work of \newciteradfordimproving. We use the same preprocessing procedure and the released pre-trained transformer. We first fine-tune the original pre-trained model on the automatically generated instances (Section 3.4) with training epoch (data flow boxed in Figure 1). In this stage, k instances are obtained based on the reference documents Lai et al. (2017) from the training and development set of RACE, with , , , and . We then adapt the model to a large-scale general domain MRC dataset RACE with training epochs (data flow boxed in Figure 1). Finally, we adapt the resulting fine-tuned model to one of the aforementioned six out-of-domain datasets (at max epochs). See data flow boxed in Figure 1. When we adapt the model to different datasets, we set language model weight to and ensemble models by averaging logits after the linear layer. The informative POS tagset {NN, NNP, NNPS, NNS, VB, VBD, VBG, VBN, VBP, VBZ, JJ, JJR, JJS, RB, RBR, RBS, CD, FW} (Section 3.3). and are initialized randomly.

4.2 Evaluation on RACE

In Table 2, we report the accuracy of the top two ranked methods on RACE. As RACE (or RACE-All) consists of two subtasks: RACE-Middle collected from middle school exams and RACE-High collected from high school exams, we also report the accuracy of our methods on both of them.

Our single and ensemble models outperform previous state-of-the-art (SOTA) by a large margin ( vs. ; vs. ). A single strategy – self-assessment and highlighting – improves over the single-model OFT () by and , respectively. Using the back and forth reading strategy, which involves two models, gives a improvement in accuracy compared to the ensemble of two original OFTs (). Strategy combination further boosts the performance. By combining strategies self-assessment and highlighting, our single model achieves a significant improvement in accuracy over the OFT baseline ( vs. ). We apply all the strategies by ensembling two such single models that read an input sequence in either the original or the reverse order, leading to a improvement in accuracy over the ensemble of two original OFTs ( vs. ).

4.3 Adaptation to Other Non-Extractive Machine Reading Comprehension Tasks

We follow the same philosophy of transferring the knowledge from a high-performing model pre-trained on a large-scale supervised data of a source task to a target task, in which only a relatively small number of training instances are available. In our experiment, we regard RACE as the source task since it contains the largest amount of general domain non-extractive questions (Table 1).

In our experiment, we regard five representative standard machine reading comprehension datasets from multiple domains as the target tasks: ARC, OpenBookQA, MCTest, MultiRC, and SemEval Task . We require some modifications to apply our method to other tasks considering their different structures. In datasets ARC and OpenBookQA, there is no reference document associated with each question. Instead, a reference corpus is provided, which consists of unordered science-related sentences relevant to questions. We first use Lucene McCandless et al. (2010) to retrieve the top sentences from the reference corpus by using the non-stop words in a question and its answer options as queries. We use the retrieved sentences to form a reference document. In the MultiRC dataset, a question could have more than one correct answer option. Therefore, we use a sigmoid function instead of softmax at the final layer (Figure 1) and regard the task as a binary (i.e., correct or incorrect) classification problem over each (document, question, answer option) instance. We also adapt our method to a non-conventional multiple-choice MRC dataset ROCStories, which aims at choosing the correct ending to a four-sentence story from two answer options Mostafazadeh et al. (2016). Since no explicit questions are provided, we leave the question sequence empty.

We investigate the effectiveness of our method in different settings. We first fine-tune the original OFT using strategies on RACE and then fine-tune the resulting model on one of the six target tasks (see Table 3). Since the test set of MultiRC is not publicly available, we report the performance of the model that achieves the highest micro-average F1 () on the development set. For other tasks, we select the model that achieves the best accuracy on the development set. When the OFT baseline is fine-tuned on RACE wihtout the use of strategies, it already outperforms previous state-of-the-art (SOTA) on four out of six datasets (OpenBookQA, MultiRC, SemEval-2018 Task 11, and ROCStories). By using strategies (Section 3) during the fine-tuning stage on RACE, we improve the performance of the baseline, leading new SOTA results on all the six datasets. We notice that, even without fine-tuning on the target data (i.e., removing data flow in Figure 1), our method already achieves strong performance on ARC Challenge (), MCTest ( | ), and MultiRC ( | | ) over previous state-of-the-art.

To further investigate the contribution of strategies, we compare our approach and the original OFT without using the extra labeled data from RACE (i.e., only keeping data flow in Figure 1). As shown in Table 4, both our single and ensemble model consistently outperform OFT. We obtain relative improvement in average accuracy over the baseline on all the datasets and especially significant improvements on datasets ARC, OpenBookQA, and MCTest.

4.4 Further Discussions on Strategies

Back and Forth Reading We notice that the input order difference between two ensemble models is likely to yield performance gains. Besides ensembling two models that use input sequence and respectively, we also investigate other reverse or almost reverse pairs. For example, we achieve better results by ensembling and () or and (), compared to the ensemble of two original OFTs on the RACE dataset ( in Table 2).

Highlighting We try two variants to define highlight embeddings (Equation 2 in Section 3.3) by considering the content of questions only or answer options only. Experiments show that using partial information yields a decrease in accuracy ( and respectively) compared to (Table 2), achieved by considering the content words in a question and its answer options.

Self-Assessment We explore alternative approaches to generate questions. For example, we use the Wikipedia articles from SQuAD Rajpurkar et al. (2016) instead of the general domain documents from the end task RACE. We generate the same number of questions as the number of questions we generate using RACE following the same steps mentioned in Section 3.4. Experiments show that this method also improves the accuracy of the OFT baseline ( vs.  ).

As self-assessing can be somehow regarded as a data augmentation method, we investigate other unsupervised question generation methods: sentence shuffling Ding and Zhou (2018) and paraphrasing based on back-translation Yu et al. (2018). Our experiments demonstrate that neither of them results in performance improvements.

5 Related Work

5.1 Methods for Multiple-Choice MRC

Here we primarily discuss methods applied to large-scale datasets such as RACE. Researchers employ a variety of methods with attention mechanisms Chen et al. (2016); Dhingra et al. (2017) for improvement through adding an elimination module Parikh et al. (2018), applying hierarchical attention strategies Zhu et al. (2018); Wang et al. (2018a), using reinforcement learning to determine the choice of multiple attention strategies Xu et al. (2018), or applying a new compositional encoder Tay et al. (2018). However, these methods seldom take the rich external knowledge (other than pre-trained word embeddings) and document structures into consideration. In this paper, we investigate different strategies based on an existing pre-trained transformer Radford et al. (2018) (Section 3.1), which leverages rich linguistic knowledge from an external corpus and achieves state-of-the-art performance on a wide range of natural language tasks.

5.2 Transfer Learning for Question Answering and MRC

Transfer learning techniques have been successfully applied to machine reading comprehension Chung et al. (2018); Golub et al. (2017) and question answering Min et al. (2017); Wiese et al. (2017). Compared to previous work, we simply fine-tune our model on the source data and then further fine-tune the entire model on the target data. The investigation of strategies such as varying the pre-training/fine-tuning data size, adding additional parameters or an L2 loss, combining different datasets for training, and fine-tuning only part of the parameters is beyond the scope of this work.

5.3 Data Augmentation for MRC Without Using External Datasets

Previous methods augment the training data by randomly reordering words or shuffling sentences Ding and Zhou (2018); Li and Zhou (2018). Question generation and paraphrasing methods have also been explored Yang et al. (2017); Yuan et al. (2017) on extractive MRC, requiring a large amount of training data or limited by the number of training instances Yu et al. (2018). In comparison, our problem (i.e., question and answer options) generation method does not rely on any existing questions in the training set, and we focus on generating problems involving the content of multiple sentences in a reference document.

6 Conclusions and Future Work

Inspired by previous research on using reading strategies to improve comprehension levels of human readers, we propose three strategies – back and forth reading, highlighting, and self-assessment – based on a pre-trained transformer, aiming at improving machine reading comprehension using limited resources.

By applying the three strategies, we obtain an accuracy of , a absolute improvement over the state-of-the-art fine-tuned transformer on the RACE dataset. By fine-tuning the pre-trained transformer on RACE using strategies, the resulting model outperforms significantly the same pre-trained transformer fine-tuned on RACE without the use of strategies, achieving new state-of-the-art results on six representative non-extractive machine reading comprehension datasets from multiple domains that require a diverse range of comprehension skills. These results consistently indicate the effectiveness of our proposed strategies and the general applicability of our fine-tuned model that incorporates these strategies.

In the future, we are interested in combining our strategies with other pre-trained models, generating more challenging problems which require more advanced skills such as summarization and sentiment analysis, and adapting our framework to more natural language processing tasks whose input is significantly different from that of the multiple-choice MRC tasks.


  1. Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. 2016. Embracing data abundance: Booktest dataset for reading comprehension. CoRR, cs.CL/1610.00956v1.
  2. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the EMNLP, pages 1183–1193, Cambridge, MA.
  3. Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In Proceedings of the ACL, pages 2358–2367, Berlin, Germany.
  4. Yu-Hsin Chen and Jinho D Choi. 2016. Character identification on multiparty conversation: Identifying mentions of characters in TV shows. In Proceedings of the SIGDial, pages 90–100, Los Angeles, CA.
  5. Zhipeng Chen, Yiming Cui, Wentao Ma, Shijin Wang, Ting Liu, and Guoping Hu. 2018. HFL-RC system at SemEval-2018 Task 11: Hybrid multi-aspects model for commonsense reading comprehension. CoRR, cs.CL/1803.05655v1.
  6. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question answering in context. Proceedings of the EMNLP.
  7. Yu-An Chung, Hung-yi Lee, and James Glass. 2018. Supervised and unsupervised transfer learning for question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1585–1594.
  8. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. CoRR, cs.CL/1803.05457v1.
  9. Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter D Turney, and Daniel Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In Proceedings of the AAAI, pages 2580–2586, Phoenix, AZ.
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, cs.CL/1810.04805v1.
  11. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of the ACL, pages 1832–1846, Vancouver, Canada.
  12. Peng Ding and Xiaobing Zhou. 2018. Ynu deep at semeval-2018 task 12: A bilstm model with neural attention for argument reasoning comprehension. In Proceedings of The SemEval, pages 1120–1123, New Orleans, LA.
  13. Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, cs.CL/1704.05179v3.
  14. David Golub, Po-Sen Huang, Xiaodong He, and Li Deng. 2017. Two-stage synthesis networks for transfer learning in machine comprehension. CoRR, cs.CL/1706.09789v3.
  15. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of the NIPS, pages 1693–1701, Montreal, Canada.
  16. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In Proceedings of the ICLR, Caribe Hilton, Puerto Rico.
  17. Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. Read+Verify: Machine reading comprehension with unanswerable questions. CoRR, cs.CL/1808.05759v4.
  18. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. CoRR, cs.CL/1705.03551v2.
  19. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the NAACL-HLT, pages 252–262, New Orleans, LA.
  20. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2017. Learning what is essential in questions. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 80–89.
  21. Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gáabor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics, 6:317–328.
  22. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the EMNLP, pages 785–794, Copenhagen, Denmark.
  23. Yongbin Li and Xiaobing Zhou. 2018. Lyb3b at semeval-2018 task 11: Machine comprehension task using deep learning models. In Proceedings of the SemEval, pages 1073–1077, New Orleans, LA.
  24. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In Proceedings of the ICLR, Vancouver, Canada.
  25. Kaixin Ma, Tomasz Jurczyk, and Jinho D Choi. 2018. Challenging reading comprehension on daily conversation: Passage completion on multiparty dialog. In Proceedings of the NAACL-HLT, pages 2039–2048, New Orleans, LA.
  26. Michael McCandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT.
  27. Danielle S McNamara, Irwin B Levinstein, and Chutima Boonthum. 2004. iSTART: Interactive strategy training for active reading and thinking. Behavior Research Methods, Instruments, & Computers, 36(2):222–233.
  28. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the EMNLP, Brussels, Belgium.
  29. Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. In Proceedings of the ACL, pages 510–517, Vancouver, Canada.
  30. Paramita Mirza and Raffaella Bernardi. 2013. Ccg categories for distributional semantic models. In Proceedings of the RANLP, pages 467–474, Hissar, Bulgaria.
  31. Kouider Mokhtari and Carla A Reichard. 2002. Assessing students’ metacognitive awareness of reading strategies. Journal of educational psychology, 94(2):249.
  32. Kouider Mokhtari and Ravi Sheorey. 2002. Measuring esl students’ awareness of reading strategies. Journal of developmental education, 25(3):2–11.
  33. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. In Proceedings of the NAACL-HLT, pages 839–849, San Diego, CA.
  34. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. CoRR, cs.CL/1611.09268v2.
  35. Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian McAuley. 2018. Learning to attend on essential terms: An enhanced retriever-reader model for open-domain question answering. CoRR, cs.CL/1808.09492v4.
  36. Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did What: A large-scale person-centered cloze dataset. In Proceedings of the EMNLP, pages 2230–2235, Austin, TX.
  37. Simon Ostermann, Michael Roth, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. 2018. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the SemEval, pages 747–757, New Orleans, LA.
  38. Soham Parikh, Ananya Sai, Preksha Nema, and Mitesh M Khapra. 2018. Eliminet: A model for eliminating options for reading comprehension with multiple choice questions.
  39. Anselmo Penas, Yusuke Miyao, Alvaro Rodrigo, Eduard H Hovy, and Noriko Kando. 2014. Overview of CLEF QA Entrance Exams Task 2014. In Proceedings of the CLEF, pages 1194–1200, Sheffield, UK.
  40. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In Preprint.
  41. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the EMNLP, pages 2383–2392, Austin, TX.
  42. Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. Coqa: A conversational question answering challenge. CoRR, cs.CL/1808.07042v1.
  43. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the EMNLP, pages 193–203, Seattle, WA.
  44. Ravi Sheorey and Kouider Mokhtari. 2001. Differences in the metacognitive awareness of reading strategies among native and non-native readers. System, 29(4):431–449.
  45. Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu Kano, Teruko Mitamura, Madoka Ishioroshi, Kelly Y Itakura, Di Wang, Tatsunori Mori, and Noriko Kando. 2014. Overview of the NTCIR-11 QA-Lab Task. In NTCIR.
  46. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the CVPR, pages 4631–4640, Las Vegas, NV.
  47. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Multi-range reasoning for machine comprehension. CoRR, cs.CL/1803.09074v1.
  48. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the RepL4NLP, pages 191–200, Vancouver, Canada.
  49. Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine. In Proceedings of the Interspeech, San Francisco, CA.
  50. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS, pages 5998–6008, Long Beach, CA.
  51. Liang Wang. 2018. Yuanfudao at SemEval-2018 Task 11: Three-way attention and relational knowledge for commonsense machine comprehension. In Proceedings of the SemEval, pages 758–762, New Orleans, LA.
  52. Shuohang Wang, Mo Yu, Shiyu Chang, and Jing Jiang. 2018a. A co-matching model for multi-choice reading comprehension. In Proceedings of the ACL, pages 1–6, Melbourne, Australia.
  53. Wei Wang, Ming Yan, and Chen Wu. 2018b. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the ACL, pages 1705–1714, Melbourne, Australia.
  54. Georg Wiese, Dirk Weissenborn, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. In Proceedings of the CoNLL, pages 281–289, Vancouver, Canada.
  55. Yichong Xu, Jingjing Liu, Jianfeng Gao, Yelong Shen, and Xiaodong Liu. 2018. Dynamic fusion networks for machine reading comprehension. CoRR, cs.CL/1711.04964v2.
  56. Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William Cohen. 2017. Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the ACL, pages 1040–1050, Vancouver, Canada.
  57. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the ICLR, Vancouver, Canada.
  58. Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Saizheng Zhang, Sandeep Subramanian, and Adam Trischler. 2017. Machine comprehension by text-to-text neural question generation. In Proceedings of the RepL4NLP, pages 15–25, Vancouver, Canada.
  59. Haichao Zhu, Furu Wei, Bing Qin, and Ting Liu. 2018. Hierarchical attention flow for multiple-choice reading comprehension. In Proceedings of the AAAI, pages 6077–6084, New Orleans, LA.
Comments 1
Request Comment
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description