Finding Generalizable Evidence by Learning to Convince Q&A Models

Finding Generalizable Evidence by Learning to Convince Q&A Models

Ethan Perez    Siddharth Karamcheti
Rob Fergus    Jason Weston    Douwe Kiela    Kyunghyun Cho
New York University, Facebook AI Research, CIFAR Azrieli Global Scholar

We propose a system that finds the strongest supporting evidence for a given answer to a question, using passage-based question-answering (QA) as a testbed. We train evidence agents to select the passage sentences that most convince a pretrained QA model of a given answer, if the QA model received those sentences instead of the full passage. Rather than finding evidence that convinces one model alone, we find that agents select evidence that generalizes; agent-chosen evidence increases the plausibility of the supported answer, as judged by other QA models and humans. Given its general nature, this approach improves QA in a robust manner: using agent-selected evidence (i) humans can correctly answer questions with only 20% of the full passage and (ii) QA models can generalize to longer passages and harder questions.

1 Introduction

Figure 1: Evidence agents quote sentences from the passage to convince a question-answering judge model of an answer.

There is great value in understanding the fundamental nature of a question chalmers2015why. Distilling the core of an issue, however, is time-consuming. Finding the correct answer to a given question may require reading large volumes of text or understanding complex arguments. Here, we examine if we can automatically discover the underlying properties of problems such as question answering by examining how machine learning models learn to solve that task.

We examine this question in the context of passage-based question-answering (QA). Inspired by work in interpreting neural networks lei2016rationalizing, we have agents find a subset of the passage (i.e., supporting evidence) that maximizes a QA model’s probability of a particular answer. Each agent (one agent per answer) finds the sentences that a QA model regards as strong evidence for its answer, using either exhaustive search or learned prediction. Figure 1 shows an example.

To examine to what extent evidence is general and independent of the model, we evaluate if humans and other models find selected evidence to be valid support for an answer too. We find that, when provided with evidence selected by a given agent, both humans and models favor that agent’s answer over other answers. When human evaluators read an agent’s selected evidence in lieu of the full passage, humans tend to select the agent-supported answer.

Given that this approach appears to capture some general, underlying properties of the problem, we examine if evidence agents can be used to assist human QA and to improve generalization of other QA models. We find that humans can accurately answer questions on QA benchmarks, based on evidence for each possible answer, using only 20% of the sentences in the full passage. We observe a similar trend with QA models: using only selected evidence, QA models trained on short passages can generalize more accurately to questions about longer passages, compared to when the models use the full passage. Furthermore, QA models trained on middle-school reading comprehension questions generalize better to high-school exam questions by answering only based on the most convincing evidence instead of the full passage. Overall, our results suggest that learning to select supporting evidence by having agents try to convince a judge model of their designated answer improves QA in a general and robust way.

2 Learning to Convince Q&A Models

Figure 1 shows an overview of the problem setup. We aim to find the passage sentences that provide the most convincing evidence for each answer option, with respect to a given QA model (the judge). To do so, we are given a sequence of passage sentences , a question , and a sequence of answer options . We train a judge model with parameters to predict the correct answer index by maximizing .

Next, we assign each answer to one evidence agent, . aims to find evidence , a subsequence of passage sentences that the judge finds to support . For ease of notation, we use set notation to describe and , though we emphasize these are ordered sequences. aims to maximize the judge’s probability on when conditioned on instead of , i.e., . We now describe three different settings of having agents select evidence, which we use in different experimental sections (§4-6).

Individual Sequential Decision-Making

Since computing the optimal directly is intractable, a single can instead find a reasonable by making sequential, greedy choices about which sentence to add to . In this setting, the agent ignores the actions of the other agents. At time , chooses index of the sentence in such that:


where is the subsequence of sentences in that has chosen until time step , i.e., with and . It is a no-op to add a sentence that is already in the selected evidence . The individual decision-making setting is useful for selecting evidence to support one particular answer.

Competing Agents: Free-for-All

Alternatively, multiple evidence agents can compete at once to support unique answers, by each contributing part of the judge’s total evidence. Agent competition is useful as agents collectively select a pool of question-relevant evidence that may serve as a summary to answer the question. Here, each of , …, finds evidence that would convince the judge to select its respective answer, , …, . chooses a sentence by conditioning on all agents’ prior choices:

where .

Agents simultaneously select a sentence each, doing so sequentially for time steps, to jointly compose the final pool of evidence. We allow an agent to select a sentence previously chosen by another agent, but we do not keep duplicates in the pool of evidence. Conditioning on other agents’ choices is a form of interaction that may enable competing agents to produce a more informative total pool of evidence. More informative evidence may enable a judge to answer questions more accurately without the full passage.

Competing Agents: Round Robin

Lastly, agents can compete round robin style, in which case we aggregate the outcomes of all pairs of answers competing. Any given participates in rounds, each time contributing half of the sentences given to the judge. In each one-on-one round, two agents select a sentence each at once. They do so iteratively multiple times, as in the free-for-all setup. To aggregate pairwise outcomes and compute an answer ’s probability, we average its probability over all rounds involving :

2.1 Judge Models

The judge model is trained on QA, and it is the model that the evidence agents need to convince. We aim to select diverse model classes, in order to: (i) test the generality of the evidence produced by learning to convince different models; and (ii) to have a broad suite of models to evaluate the agent-chosen evidence. Each model class assigns every answer a score, where the predicted answer is the one with the highest score. We use this score as a softmax logit to produce answer probabilities. Each model class computes in a different manner. In what follows, we describe the various judge models we examine.


We define a function that embeds text into its corresponding TFIDF-weighted bag-of-words vector. We compute the cosine similarity of the embeddings for two texts X and Y:

We define two model classes that select the answer most similar to the input passage sentences: , and .


We define a function that computes the average bag-of-words representation of some text using fastText embeddings joulin2017bag. We use 300-dimensional fastText word vectors pretrained on Common Crawl. We compute the cosine similarity between the embeddings for two texts X and Y using:

This method has proven to be a strong baseline for evaluating the similarity between two texts perone2018evaluation. Using this function, we define a model class that selects the answer most similar to the input passage context: .


is computed using the multiple-choice adaptation of BERT devlin2019bert; radford2018improving; si2019bert, a pre-trained transformer network vaswani2018attention. We fine-tune all BERT parameters during training. This model predicts using a trainable vector and BERT’s first token embedding: .

We experiment with both the  model (12 layers) and  (24 layers). For training details, see Appendix B.

2.2 Evidence Agents

In this section, we describe the specific models we use as evidence agents. The agents select sentences according to Equation 1, either exactly or via function approximation.

Search agent

at time chooses the sentence that maximizes , after exhaustively trying each possible . Search agents that query TFIDF or fastText models maximize TFIDF or fastText scores directly (i.e., , rather than ).

Predicting Loss Target
Search CE
Table 1: The loss functions and prediction targets for three learned agents. CE: cross entropy. MSE: mean squared error. takes on integer values from 1 to .
Learned agent

We train a model to predict how a sentence would influence the judge’s answer, instead of directly evaluating answer probabilities at test time. This approach may be less prone to selecting sentences that exploit hard-to-predict quirks in the judge; humans may be less likely to find such sentences to be valid evidence for an answer (discussed in §4.1). We define several loss functions and prediction targets, shown in Table 1. Each forward pass, agents predict one scalar per passage sentence via end-of-sentence token positions. We optimize these predictions using Adam kingma2015adam on one loss from Table 1. For , we find it effective to simply predict the judge model at and use this distribution for all time steps during inference. This trick speeds up training by enabling us to precompute prediction targets using the judge model, instead of querying it constantly during training.

We use  for all learned agents. Learned agents predict the  judge, as it is more efficient to compute than . Each agent is assigned the answer that it should support. We train one learned agent to find evidence for an arbitrary answer . We condition on using a binary indicator when predicting . We add the indicator to BERT’s first token segment indicator and embed it into vectors and ; for each timestep’s features from BERT, we scale and shift element-wise:  (perez2018film; dumoulin2018feature-wise). See Appendix B for training details.

Notably, learning to convince a judge model does not require answer labels to a question. Even if the judge only learns from a few labeled examples, evidence agents can learn to model the judge’s behavior on more data and out-of-distribution data without labels.

3 Experimental Setup

3.1 Evaluating Evidence Agents

Evaluation Desiderata

An ideal evidence agent should be able to find evidence for its answer w.r.t. a judge, regardless (to some extent) of the specific answer it defends. To appropriately evaluate evidence agents, we need to use questions with more than one defensible, passage-supported answer per question. In this way, an agent’s performance will not depend disproportionately on the answer it is to defend, rather than its ability to find evidence.

Multiple-choice QA: RACE and DREAM

For our experiments, we use RACE lai2017race and DREAM sun2018dream, two multiple-choice, passage-based QA datasets. Both consist of reading comprehension exams for Chinese students learning English; teachers explicitly designed answer options to be plausible (even if incorrect), in order to test language understanding. Each question has 4 total answer options in RACE and 3 in DREAM. Exactly one option is correct. DREAM consists of 10K informal, dialogue-based passages. RACE consists of 100K formal, written passages (i.e., news, fiction, or well-written articles). RACE also divides into easier, middle school questions (29%) and harder, high school questions (71%).

Other datasets we considered

Multiple-choice passage-based QA tasks are well-suited for our purposes. Multiple-choice QA allows agents to support clear, dataset-curated possible answers. In contrast, sugawara2018what show that 5-20% of questions in extractive, span-based QA datasets have only one valid candidate option. For example, some “when” questions are about passages with only one date. sugawara2018what argue that multiple-choice datasets such as RACE do not have this issue, as answer candidates are manually created. In preliminary experiments on SQuAD rajpurkar2016squad, we found that agents could only learn to convince the judge model when supporting the correct answer (one answer per question).

3.2 Training and Evaluating Models

Judge Model RACE DREAM
Random 25.0 33.3
TFIDF 32.6 44.4
TFIDF 31.6 44.5
fastText 30.4 38.4
65.4 61.0
69.4 64.9
Human Adult* 94.5 98.6
Table 2: RACE and DREAM test accuracy of various judge models using the full passage. Our agents use these models to find evidence. The models cover a spectrum of QA ability. (*) reports ceiling accuracy from original dataset papers.

Our setup is not directly comparable to standard QA setups, as we aim to evaluate evidence rather than raw QA accuracy. However, each judge model’s accuracy is useful to know for analysis purposes. Table 2 shows model accuracies, which cover a broad range. BERT models significantly outperform word-based baselines (TFIDF and fastText), and  achieves the best overall accuracy. No model achieves the estimated human ceiling for either RACE lai2017race or DREAM sun2018dream.

Our code is available at We build off AllenNLP gardner2017allennlp using PyTorch paszke2017automatic. For all human evaluations, we use Amazon Mechanical Turk via ParlAI miller2017parlai. Appendix B describes preprocessing and training details.

4 Agents Select General Evidence

How Often Human Selects Agent’s Answer (%)
Evidence Sentence Agent Answer is Agent Answer is
Selection Method Overall Right Wrong Overall Right Wrong
Baselines No Sentence Given 25.0 52.5 15.8 33.3 43.3 28.4
Human Selection 41.6 75.1 30.4 50.7 84.9 33.5
Search Agents TFIDF 33.5 69.6 21.5 41.7 68.8 28.1
    querying… fastText 37.1 74.2 24.7 41.5 75.6 24.5
TFIDF 38.0 71.4 26.9 43.4 75.2 27.6
38.4 68.4 28.4 50.5 82.5 34.6
40.1 71.0 29.9 52.3 79.4 38.7
Learned Agents Search 40.0 71.0 29.7 49.1 78.3 34.6
    predicting… 42.0 74.6 31.1 50.0 77.3 36.3
41.1 73.2 30.4 48.2 76.5 34.0
Table 3: Human evaluation: Search Agents select evidence by querying the specified judge model, and Learned Agents predict the strongest evidence w.r.t. a judge model (); humans then answer the question using the selected evidence sentence (without the full passage). Most agents do on average find evidence for their answer, right or wrong. Agents are more effective at supporting right answers.

4.1 Human Evaluation of Evidence

Would evidence that convinces a model also be valid evidence to humans? On one hand, there is ample work suggesting that neural networks can learn similar patterns as humans do. Convolutional networks trained on ImageNet share similarities with the human visual cortex cadieu2014deep. In machine translation, attention learns to align foreign words with their native counterparts bahdanau2015neural. On the other hand, neural networks often do not behave as humans do. Neural networks are susceptible to adversarial examples—changes to the input which do or do not change the network’s prediction in surprising ways szegedy2014intriguing; jia2017adversarial; ribeiro2018semantically; alzantot2018generating. Convolutional networks rely heavily on texture geirhos2018imagenettrained, while humans rely on shape landau1998importance. Neural networks trained to recognize textual entailment can rely heavily on dataset biases gururangan2018annotation.

Human evaluation setup

We use human evaluation to assess how effectively agents select sentences that also make humans more likely to provide a given answer, when humans act as the judge. Humans answer based only on the question , answer options , and a single passage sentence chosen by the agent as evidence for its answer option (i.e., using the “Individual Sequential Decision-Making” scheme from §2). Appendix C shows the interface and instructions used to collect evaluations. For each of RACE and DREAM, we use 100 test questions and collect 5 human answers for each pair for each agent. We also evaluate a human baseline for this task, where 3 annotators select the strongest supporting passage sentence for each pair. We report the average results across 3 annotators.

Humans favor answers supported by evidence agents

when shown that agent’s selected evidence, as shown in Table 3.111Appendix D shows results by question type. Without receiving any passage sentences, humans are at random chance at selecting the agent’s answer (25% on RACE, 33% on DREAM), since agents are assigned an arbitrary answer. For all evidence agents, humans favor agent-supported answers more often than the baseline (33.5-42.0% on RACE and 41.7-50.5% on DREAM). For our best agents, the relative margin over the baseline is substantial. In fact, these agents select evidence that is comparable to human-selected evidence. For example, on RACE, humans select the target answer 41.6% when provided with human-selected evidence, compared to 42% evidence selected by the learned agent that predicts .

All agents support right answers more easily than wrong answers. On RACE, the learned agent that predicts finds strong evidence more than twice as often for correct answers than for incorrect ones (74.6% vs. 31.1%). On RACE and DREAM both, BERT-based agents (search or learned agents) find stronger evidence than word-based agents do. Humans tend to find that BERT-based agents select valid evidence for an answer, right or wrong. On DREAM, word-based agents generally fail to find evidence for wrong answers compared to the no-sentence baseline (28.4% vs. 24.5% for a search-based fastText agent).

On RACE, learned agents that predict the  judge outperform search agents that directly query the  judge. This effect may occur if search agents find an adversarial sentence that unduly affects the judge’s answer but that humans do not find to be valid evidence. Appendix A shows one such example. Learned agents may have difficulty predicting such sentences, without directly querying the judge. Appendix E provides some analysis on why learned agents may find more general evidence than search agents do. Learned agents are most accurate at predicting evidence sentences when the sentences have a large impact on the judge model’s confidence in the target answer, and such sentences in turn are more likely to be found as strong evidence by humans. On DREAM, search agents and learned agents perform similarly, likely because DREAM has 14x less training data than RACE.

4.2 Model Evaluation of Evidence

Figure 2: On RACE, how often each judge selects an agent’s answer when given a single agent-chosen sentence. The black line divides learned agents (right) and search agents (left), with human evidence selection in the leftmost column. All agents find evidence that convinces judge models more often than a no-evidence baseline (25%). Learned agents predicting or find the most broadly convincing evidence.
Evaluating an agent’s evidence across models

Beyond human evaluation, we test how general agent-selected evidence is, by testing this evidence against various judge models. We expect evidence agents to most frequently convince the model they are optimized to convince, by nature of their direct training or search objective. The more similar models are, the more we expect evidence from one model to be evidence to another. To some extent, we expect different models to rely on similar patterns to answer questions. Thus, evidence agents should sometimes select evidence that transfers to any model. However, we would not expect agent evidence to transfer to other models if models only exploit method-specific patterns.

Experimental setup

Each agent selects one evidence sentence for each pair. We test how often the judge selects an agent’s answer, when given this sentence, , and . We evaluate on all pairs in RACE’s test set. Human evaluations are on a 100 question subset of test.


Figure 2 plots how often each judge selects an agent’s answer. Without any evidence, judge models are at random at choosing an agent’s assigned answer (25%). All agents find evidence that convinces judge models more often than the no-evidence baseline. Learned agents that predict or find the evidence most broadly considered convincing; other judge models select these agents’ supported answers over 46% of the time. These findings support that evidence agents find general structure despite aiming to convince specific methods with their distinct properties.

Notably, evidence agents are not uniformly convincing across judge models. All evidence agents are most convincing to the judge model they aim to convince; across any given agent’s row, an agent’s target judge model is the model which most frequently selects the agent’s answer. Search agents are particularly effective at finding convincing evidence w.r.t. their target judge model, given that they directly query this model. More broadly, similar models find similar evidence convincing. We find similar results for DREAM (Appendix F).

5 Evidence Agents Aid Generalization

We have shown that agents capture method-agnostic evidence representative of answering a question (the strongest evidence for various answers). We hypothesize that QA models can generalize better out of distribution to more challenging questions by exploiting evidence agents’ capability to understand the problem.

Throughout this section, using various train/test splits of RACE, we train a  judge on easier examples (involving shorter passages or middle-school exams) and test its generalization to harder examples (involving longer passages or high-school exams). Judge training follows §2.1. We compare QA accuracy when the judge answers using (i) the full passage and (ii) only evidence sentences chosen by competing evidence agents. We report results using the round robin competing agent setup described in §2, as it resulted in higher generalization accuracy than free-for-all competition in preliminary experiments. Each competing agent selects sentences up to a fixed, maximum turn limit; we experiment with 3-6 turns per agent (6-12 total sentences for the judge), and we report the best result. We train learned agents (as described in §2.2) on the full RACE dataset without labels, so these agents can model the judge using more data and on out-of-distribution data.

For reference, we evaluate judge accuracy on a subsequence of randomly sampled sentences; we vary the number of sentences sampled from 6-12 and report the best result. As a lower bound, we train an answer-only model to evaluate how effectively the QA model is using the passage sentences it is given. As an upper bound, we evaluate our  judge trained on all of RACE, requiring no out-of-distribution generalization.

5.1 Generalizing to Longer Passages

Train Sentences in Passage
Data Sentence Selection
All Full Passage 64.7 60.0 71.2
RACE None (Answer-only) 36.1 40.2 38.5
Full Passage of Subset 57.4 44.1 65.0
Random Sentences 49.2 44.7 48.2
TFIDF 57.2 48.0 67.3
fastText 57.7 50.2 64.2
TFIDF 57.1 47.9 64.6
Search over 56.7 49.6 68.9
Predict   56.7 50.0 66.9
Table 4: We train a judge on short RACE passages and test its generalization to long passages. The judge is more accurate on long passages when it answers based on only sentences chosen by competing agents (last 5 rows) instead of the full passage. BERT-based agents aid generalization even under test-time domain shift (from RACE to DREAM).

We train a judge on RACE passages averaging 10 sentences long (all training passages each with 12 sentences); this data is roughly th of RACE. We test the judge on RACE passages averaging 30 sentences long.


Table 4 shows the results. Using the full passage, the judge outperforms an answer-only BERT baseline by 4% (44.1% vs. 40.2%). When answering using the smaller set of agent-chosen sentences, the judge outperforms the baseline by 10% (50.2% vs. 40.2%), more than doubling its relative use of the passage. Both search and learned agents aid the judge model in generalizing to longer passages. The improved generalization is not simply a result of the judge using a shorter passage, as shown by the random sentence selection baseline (44.7%).

5.2 Generalizing Across Domains

We examine if evidence agents aid generalization even in the face of domain shift. We test the judge trained on short RACE passages on long passages from DREAM. We use the same evidence agents from the previous subsection; the learned agent is trained on RACE only, and we do not fine-tune it on DREAM to test its generalization to finding evidence in a new domain. DREAM passages consist entirely of dialogues, use more informal language and shorter sentences, and emphasize general world knowledge and commonsense reasoning sun2018dream. RACE passages are more formal, written articles (e.g. news or fiction).


Table 4 shows that BERT-based evidence agents aid generalization even under domain shift. The model shows notable improvements for RACE DREAM transfer when it predicts from BERT-based agent evidence rather than the full passage (65.0% vs. 68.9%). These results support that our best evidence agents capture something fundamental to the problem of QA, despite changes in e.g. content and writing style.

5.3 Generalizing to Harder Questions

Train School Level
Data Sentence Selection Middle High
All Full Passage 70.8 63.2
Middle None (Answer-only) 38.9 40.2
School Full Passage of Subset 66.2 50.7
only Random Sentences 54.8 47.0
TFIDF 65.1 50.4
fastText 64.6 50.8
TFIDF 64.9 51.0
Search over 67.0 53.0
Predict   67.3 51.9
Table 5: Generalizing to harder questions: We train a judge to answer questions with RACE’s Middle School exam questions only. We test its generalization to High School exam questions. The judge is more accurate when using evidence agent sentences (last 5 rows) rather than the full passage.
Figure 3: Generalizing to harder questions by question type: We train a judge on RACE Middle School questions and test its generalization to RACE High School questions. To predict the answer, the judge uses either the full passage or evidence sentences chosen by a BERT-based search agent. The worse the judge does on a question category using the full passage, the better it does when using the agent-chosen sentences.

Using RACE, we train a judge on middle-school questions and test it on high-school questions.


Table 5 shows that the judge generalizes to harder questions better by using evidence from either search-based BERT agents (53.0%) or learned BERT agents (51.9%) compared to using the full passage directly (50.7%) or to search-based TFIDF and fastText agents (50.4%-51.0%). Figure 3 shows that the improved generalization comes from questions the model originally generalizes worse on. Simplifying the passage by providing key sentences may aid generalization by e.g. removing extraneous or distracting sentences from passages with more uncommon words or complex sentence structure. Such improvements come at the cost of accuracy on easier, word-matching questions, where it may be simpler to answer with the full passage as seen in training.

6 Evidence Agents Aid Human QA

As observed in §4.1, evidence agents more easily support right answers than wrong ones. Furthermore, evidence agents do aid QA models in generalizing systematically when all answer evidence sentences are presented at once. We hypothesize that when we combine all evidence sentences, humans prefer to choose the correct answer.

Human evaluation setup

Evidence agents compete in a free-for-all setup (§2), and the human acts as the judge. We evaluate how accurately humans can answer questions based only on agent sentences. Appendix C shows the annotation interface and instructions. We collect 5 human answers for each of the 100 test questions.

Humans can answer using evidence sentences alone

Shown in Table 6, humans correctly answer questions using many fewer sentences (3.3 vs. 18.2 on RACE, 2.4 vs. 12.2 on DREAM); they do so while maintaining 90% of human QA accuracy on the full passage (73.2% vs. 82.3% on RACE, 83.8% vs. 93.0% on DREAM). Evidence agents, however, vary in how effectively they aid human QA, compared to answer-agnostic evidence selection. On DREAM, humans answer with 79.1% accuracy using the sentences most similar to the question alone (via fastText), while achieving lower accuracy when using the  search agent’s evidence (75.0%) and higher accuracy when using the  search agent’s evidence (83.8%). We explain the discrepancy by examining how effective agents are at supporting right vs. wrong answers (Table 3 from §4.1);  is more effective than  at finding evidence for right answers (82.5% vs. 79.4%) and less effective at finding evidence for wrong answers (34.6% vs. 38.7%).

Sentences Shown Human Acc. (%)
    Selection Type Selection Method RACE DREAM
Full Passage Full Passage 82.3 93.0
No Passage Answer-only 52.5 43.3
Subset (~20%) Human Selection 73.5 82.3
    Answer-Free First Sentences 61.8 68.5
    Selection TFIDF 69.2 77.5
fastText 69.7 79.1
    Search Agent TFIDF 66.1 70.0
    Selection TFIDF 73.2 77.0
fastText 73.2 77.3
69.9 83.8
72.4 75.0
    Learned Agent Predicting Search 66.5 80.0
    Selection Predicting 71.6 77.8
Predicting 65.7 81.5
Table 6: Human accuracy using evidence agent sentences: Each agent selects a sentence supporting its own answer. Humans answer the question given these agent-selected passage sentences only. Humans still answer most questions correctly, while reading many fewer passage sentences.

7 Related Work

Here, we discuss further related work, beyond that discussed in §4.1 on (dis)similarities between patterns learned by humans and neural networks.

Evidence Extraction

Various papers have explored the related problem of extracting evidence or summaries to aid downstream QA. wang2018evidence-extraction concurrently introduced a neural model that extracts evidence specifically for the correct answer, as an intermediate step in a QA pipeline. Prior work uses similar methods to explain what a specific model has learned lei2016rationalizing; li2016understanding; yu2019learning. Others extract evidence to improve downstream QA efficiency over large amounts of text choi2017coarse; wang2018r3; wang2018evidence-aggregation. More broadly, extracting evidence can facilitate fact verification thorne2018fever and debate.222IBM Project Debater:

Generic Summarization

In contrast, various papers focus primarily on summarization rather than QA, using downstream QA accuracy only as a reward to optimize generic (question-agnostic) summarization models arumae2018reinforced; arumae2019guiding; eyal2019question.


Evidence extraction can be viewed as a form of debate, in which multiple agents support different stances (irving2018ai; irving2019ai). chen2018cicero show that evidence-based debate improves the accuracy of crowdsourced labels, similar to our work which shows its utility in natural language QA.

8 Conclusion

We examined if it was possible to automatically distill general insights for passage-based question answering, by training evidence agents to convince a judge model of any given answer. Humans correctly answer questions while reading only 20% of the sentences in the full passage, showing the potential of our approach for assisting humans in question answering tasks. We examine how selected evidence affects the answers of humans as well as other QA models, and we find that agent-selected evidence is generalizable. We exploit these capabilities by employing evidence agents to facilitate QA models in generalizing to longer passages and out-of-distribution test sets of qualitatively harder questions.


EP was supported by the NSF Graduate Research Fellowship and ONR grant N00014-16-1-2698. KC thanks support from eBay and NVIDIA. We thank Adam Gleave, David Krueger, Geoffrey Irving, Katharina Kann, Nikita Nangia, and Sam Bowman for helpful conversations and feedback. We thank Jack Urbanek, Jason Lee, Ilia Kulikov, Ivanka Perez, Ivy Perez, and our Mechanical Turk workers for help with human evaluations.


Passage (DREAM)
W:   What changes do you think will take place in the next 50 years?
M:   I imagine that the greatest change will be the difference between humans and machines.
W:   What do you mean?
M:   I mean it will be harder to tell the difference between the human and the machine.
W:   Can you describe it more clearly?
M:   As science develops, it will be possible for all parts of one’s body to be replaced. A computer will work like the human brain. The computer can recognize one’s feelings, and act in a feeling way.
W:   You mean man-made human beings will be produced? Come on! That’s out of the question!
M:   Don’t get excited, please. That’s only my personal imagination!
W:   Go on, please. I won’t take it seriously.
M:   We will then be able to create a machine that is a copy of ourselves. We’ll appear to be alive long after we are dead.
W:   What a ridiculous idea!
M: It’s possible that a way will be found to put our spirit into a new body. Then, we can choose to live as long as we want.
W:   In that case, the world would be a hopeless mess!
Q: What are the two speakers talking about?
A. Computers in the future.
B. People’s imagination.
C. Possible changes in the future.
Table 7: An example from our best evidence agent on DREAM, a search agent using . Each evidence agent has chosen a sentence (in color) that convinces a  judge model to predict the agent’s designated answer with over 99% confidence.
Passage (RACE)
Who doesn’t love sitting beside a cosy fire on a cold winter’s night? Who doesn’t love to watch flames curling up a chimney? Fire is one of man’s greatest friends, but also one of his greatest enemies. Many big fires are caused by carelessness. A lighted cigarette thrown out of a car or train window or a broken bottle lying on dry grass can start a fire. Sometimes, though, a fire can start on its own. Wet hay can begin burning by itself. This is how it happens: the hay starts to rot and begins to give off heat which is trapped inside it. Finally, it bursts into flames. That’s why farmers cut and store their hay when it’s dry. Fires have destroyed whole cities. In the 17th century, a small fire which began in a baker’s shop burnt down nearly every building in London. Moscow was set on fire during the war against Napoleon. This fire continued burning for seven days. And, of course, in 64 A.D. a fire burnt Rome. Even today, in spite of modern fire-fighting methods, fire causes millions of pounds’ worth of damage each year both in our cities and in the countryside. It has been wisely said that fire is a good servant but a bad master.
Q: Many big fires are caused…
A. by cigarette     B. by their own     C. by dry grass     D. by people’s carelessness
Table 8: In this example, each answer’s agent has chosen a sentence (in color) that individually influenced a neural QA model to answer in its favor. When human evaluators answer the question using only one agent’s sentence, evaluators select the agent-supported answer. When humans read all 4 agent-chosen sentences together, they correctly answer “D”, without reading the full passage.
Passage (RACE)
Yueyang Tower lies in the west of Yueyang City, near the Dongting Lake. It was first built for soldiers to rest on and watch out. In the Three Kingdoms Period, Lu Su, General of Wu State, trained his soldiers here. In 716, Kaiyuan of Tang Dynasty, General Zhang Shuo was sent to defend at Yuezhou and he rebuilt it into a tower named South Tower, and then Yueyang Tower. In 1044, Song Dynasty, Teng Zijing was stationed at Baling Jun, the ancient name of Yueyang City. In the second year, he had the Yueyang Tower repaired and had poems by famous poets written on the walls of the tower. Fan Zhongyan, a great artist and poet, was invited to write the well - known poem about Yueyang Tower. In his A Panegyric of the Yueyang Tower, Fan writes: ”Be the first to worry about the troubles across the land, the last to enjoy universal happiness.” His words have been well - known for thousands of years and made the tower even better known than before. The style of Yueyang Tower is quite special. The main tower is 21.35 meters high with 3 stories, flying eave and wood construction, the helmet-roof of such a large size is a rarity among the ancient architectures in China. Entering the tower, you’ll see ”Dongting is the water of the world, Yueyang is the tower of the world”. Moving on, there is a platform that once used as the training ground for the navy of Three-Kingdom Period general Lu Su. To its south is the Huaifu Pavilion in honor of Du Fu. Stepping out of the Xiaoxiang Door, the Xianmei Pavilion and the Sanzui Pavilion can be seen standing on two sides. In the garden to the north of the tower is the tomb of Xiaoqiao, the wife of Zhou Yu.
Q: Yueyang Tower was once named…
A. South Tower ✓    B. Xianmei Tower     C. Sanzui Tower     D. Baling Tower
Table 9: An example where each answer’s search agents successfully influences the answerer to predict that agent’s answer; however, the supporting sentence for “B” and for “C” are not evidence for the corresponding answer. These search agents have found adversarial examples in the passage that unduly influence the answerer. Thus, it can help to present the answerer model with evidence for answers at once, so the model can weigh potentially adversarial evidence against valid evidence. In this case, the model correctly answers “B” when predicting based on all 4 agent-chosen sentences.
Passage (RACE)
A desert is a beautiful land of silence and space. The sun shines, the wind blows, and time and space seem endless. Nothing is soft. The sand and rocks are hard, and many of the plants even have hard needles instead of leaves. The size and location of the world’s deserts are always changing. Over millions of years, as climates change and mountains rise, new dry and wet areas develop. But within the last 100 yeas, deserts have been growing at a frightening speed. This is partly because of natural changes, but the greatest makers are humans. Humans can make deserts, but humans can also prevent their growth. Algeria Mauritania is planting a similar wall around Nouakchott, the capital. Iran puts a thin covering of petroleum on sandy areas and plants trees. The oil keeps the water and small trees in the land, and men on motorcycles keep the sheep and goats away. The USSR and India are building long canals to bring water to desert areas.
Q: Which of the following is NOT true?
A. The greatest desert makers are humans.     B. There aren’t any living things in the deserts.
C. Deserts have been growing quickly.     D. The size of the deserts is always changing.
Table 10: In this example, the answerer correctly predicts “B,” no matter the passage sentence (in color) a search agent provides. This behavior occurred in several cases where the question and answer options contained a strong bias in wording that cues the right answer. Statements including “all,” “never,” or “there aren’t any” are often false, which in this example signals the right answer. gururangan2018annotation find similar patterns in natural language inference data, where “no,” “never,” and “nothing” strongly signal that one statement contradicts another.

Appendix A Additional Evidence Agent Examples

We show additional examples of evidence agent sentence selections in Table 7 (DREAM), as well as Tables 89, and 10 (RACE).

Appendix B Implementation Details

b.1 Preprocessing

We use the BERT tokenizer to tokenize the text for all methods (including TFIDF and fastText). To divide the passage into sentences, we use the following tokens as end-of-sentence markers: “.”, “?”, “!”, and the last passage token. For BERT, we use the required WordPiece subword tokenization schuster2012japanese. For TFIDF, we also use WordPiece tokenization to minimize the number of rare or unknown words. For consistency, this tokenization uses the same vocabulary as our BERT models do. FastText is trained to embed whole words directly, so we do not use subword tokenization.

b.2 Training the Judge

Here we provide additional implementation details of the various judge models.

b.2.1 Tfidf

To limit the number of rare or unknown words, we use subword tokenization via the BERT WordPiece tokenizer. Using this tokenizer enables us to split sentences in an identical manner as for BERT so that results are comparable. For a given dataset, we compute inverse document frequencies for subword tokens using the entire corpus.

b.2.2 Bert

Architecture and Hyperparameters

We use the uncased  pre-trained transformer. We sweep over BERT fine-tuning hyperparameters, using the following ranges: learning rate and batch size .

Segment Embeddings

BERT uses segment embeddings to indicate two distinct, contiguous sequences of input text. These segments are also separated by a special [SEP] token. The first segment is , and the second segment is [; ].

Truncating Long Passages

BERT can only process a maximum of 512 tokens at once. Thus, we truncate the ends of longer passages; we always include the full question and answer , as these are generally important in answering the question. We include the maximum number of passage tokens such that the entire input (i.e., or ) fits within 512 tokens.

Training Procedure

We train for up to 10 epochs, stopping early if validation accuracy decreases after an epoch once (RACE) or 3 times (DREAM). For DREAM, we also decay the learning rate by whenever validation accuracy does not decrease after an epoch.

b.3 Training Evidence Agents

We use the  architecture for all learned evidence agents. The training details are the same as for the BERT judge, with the exceptions listed below. Agents make sentence-level predictions via end-of-sentence token positions.


Training learned agents on RACE is expensive, due to the dataset size and number of answer options to make predictions for. Thus, for these agents only (not DREAM agents), we sweep over a limited range that works well: learning rate and batch size .

Training Procedure

We use early stopping based on validation loss instead of answering accuracy, since evidence agents do not predict the correct answer.

Appendix C Human Evaluation Details

Figure 4: Interface for humans to answer questions based on one agent-selected passage sentence only. In this example from DREAM, a learned agent supports the correct answer (B).
Figure 5: Interface for humans to answer questions based on agent-selected passage sentences only. Each answer’s evidence agent selects one sentence. These sentences are combined and shown to the human, in the order they appear in the passage. In this example from RACE, the agents are search-based, and the correct answer is B.

For all human evaluations, we filter out workers who perform poorly on a few representative examples of the evaluation task. We pay workers on average $15.48 per hour according to TurkerView ( We require workers to be from predominantly English-speaking countries: Australia, Canada, Great Britain, New Zealand, or the U.S. We do not use results from workers who complete the evaluation significantly faster than other workers (i.e., less than a few seconds per question). To incentivize workers, we also offer a bonus for answering questions more accurately than the average worker. Figures 4 and 5 show two examples of our evaluation setup.

Appendix D Human Evaluation of Agent Evidence by Question Category

How Often Human Selects Agent’s Answer (%)
School Level Question Type Question Type
Evidence Sentence Overall Middle High Word Para- Single Sent. Multi-Sent. Ambi- Overall Common Logic Word-Match/ Summary
Selection Method Match phrase Reasoning Reasoning guous Sense Paraphrase
Baselines No Sentence 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 33.3 33.3 33.3 33.3 33.3
Human Selection 38.1 46.4 39.5 44.6 41.3 41.7 41.7 38.5 50.7 50.0 50.6 48.2 52.1
Search Agents TFIDF 33.5 36.5 32.2 35.0 36.1 31.8 34.2 32.7 41.7 37.2 42.4 37.1 41.8
    querying… TFIDF 38.0 41.8 36.4 44.8 39.9 38.4 35.2 31.1 43.4 40.0 42.7 46.4 42.7
fastText 37.1 40.3 35.7 38.2 37.9 38.1 36.2 34.4 41.5 41.0 42.2 37.0 40.7
38.4 40.4 37.5 44.5 36.7 39.2 37.2 39.4 50.5 48.2 50.6 52.1 50.2
40.1 44.5 38.3 41.3 38.8 39.9 42.0 39.0 52.3 49.8 50.3 59.3 54.5
Learned Agents: Search 40.0 42.0 39.2 43.7 41.8 39.3 41.2 38.1 49.1 44.6 49.9 47.9 45.9
    predicting… 42.0 44.3 41.0 47.0 43.6 42.3 41.9 34.3 50.0 47.6 50.1 47.3 49.6
41.1 44.9 39.5 43.7 41.4 41.0 41.9 39.6 48.2 45.5 47.1 55.5 47.2
Table 11: Human evaluations: Search Agents select evidence by querying the specified judge model, and Learned Agents predict the strongest evidence w.r.t. a judge model (); humans then answer the question using the selected evidence sentence (without the full passage).

We show a detailed breakdown of results from §4.1, where humans answer questions using an agent-chosen sentence. Table 11 shows how often humans select the agent-supported answer, broken down by question type. Models that perform better generally do so across all categories. However, methods incorporating neural methods generally achieve larger gains over word-based methods on multi-sentence reasoning questions on RACE.

Appendix E Analysis

Figure 6: Learned agent validation accuracy at predicting the top sentence chosen by search over the judge ( on RACE). The stronger evidence a judge model finds a sentence to be, the easier it is to predict as the being an answer’s strongest evidence sentence in the passage. This effect holds regardless of the agent’s particular training objective.
Figure 7: We find the passage sentence that would best support an answer to a particular judge model (i.e., using a search agent). We plot the judge’s probability of the target answer given that sentence against how often humans also select that target answer given that same sentence. Humans tend to find a sentence to be strong evidence for an answer when the judge model finds it to be strong evidence.
Highly convincing evidence is easiest to predict

Figure 6 plots the accuracy of a search-predicting evidence agent at predicting the search-chosen sentence, based on the magnitude of that sentence’s effect on the judge’s probability of the target answer. Search-predicting agents more easily predict search’s sentence the greater the effect that sentence has on the judge’s confidence.

Strong evidence to a model tends to be strong evidence to humans

as shown in Figure 7. Combined with the previous result, we can see that learned agents are more accurate at predicting sentences that humans find to be strong evidence.

Appendix F Model Evaluation of Evidence on DREAM

Figure 8: On DREAM, how often each judge selects an agent’s answer when given a single agent-chosen sentence. The black line divides learned agents (right) and search agents (left), with human evidence selection in the leftmost column. All agents find evidence that convinces judge models more often than a no-evidence baseline (33%). Learned agents predicting or find the most broadly convincing evidence.

Figure 8 shows how convincing various judge models find each evidence agent. Our findings on DREAM are similar to those from RACE in §4.2.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description