Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

Yair Feldman    Ran El-Yaniv
Department of Computer Science
Technion – Israel Institute of Technology
Haifa, Israel
{yairf11, rani}

This paper is concerned with the task of multi-hop open-domain Question Answering (QA). This task is particularly challenging since it requires the simultaneous performance of textual reasoning and efficient searching. We present a method for retrieving multiple supporting paragraphs, nested amidst a large knowledge base, which contain the necessary evidence to answer a given question. Our method iteratively retrieves supporting paragraphs by forming a joint vector representation of both a question and a paragraph. The retrieval is performed by considering contextualized sentence-level representations of the paragraphs in the knowledge source. Our method achieves state-of-the-art performance over two well-known datasets, SQuAD-Open and HotpotQA, which serve as our single- and multi-hop open-domain QA benchmarks, respectively. 111Code is available at

Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

Yair Feldman and Ran El-Yaniv Department of Computer Science Technion – Israel Institute of Technology Haifa, Israel {yairf11, rani}

1 Introduction

Textual Question Answering (QA) is the task of answering natural language questions given a set of contexts from which the answers to these questions can be inferred. This task, which falls under the domain of natural language understanding, has been attracting massive interest due to extremely promising results that were achieved using deep learning techniques. These results were made possible by the recent creation of a variety of large-scale QA datasets, such as TriviaQA (Joshi et al., 2017) and SQuAD (Rajpurkar et al., 2016). The latest state-of-the-art methods are even capable of outperforming humans on certain tasks (Devlin et al., 2018)222

The basic and arguably the most popular task of QA is often referred to as Reading Comprehension (RC), in which each question is paired with a relatively small number of paragraphs (or documents) from which the answer can potentially be inferred. The objective in RC is to extract the correct answer from the given contexts or, in some cases, deem the question unanswerable (Rajpurkar et al., 2018). Most large-scale RC datasets, however, are built in such a way that the answer can be inferred using a single paragraph or document. This kind of reasoning is termed single-hop reasoning, since it requires reasoning over a single piece of evidence. A more challenging task, called multi-hop reasoning, is one that requires combining evidence from multiple sources (Talmor and Berant, 2018; Welbl et al., 2018; Yang et al., 2018). Figure 1 provides an example of a question requiring multi-hop reasoning. To answer the question, one must first infer from the first context that Alex Ferguson is the manager in question, and only then can the answer to the question be inferred with any confidence from the second context.

Question: The football manager who recruited David Beckham managed Manchester United during what timeframe? Context 1: The 1995–96 season was Manchester United’s fourth season in the Premier League … Their triumph was made all the more remarkable by the fact that Alex Fergusonhad drafted in young players like Nicky Butt, David Beckham, Paul Scholes and the Neville brothers, Gary and Phil. Context 2: Sir Alexander Chapman Ferguson, CBE (born 31 December 1941) is a Scottish former football manager and player who managed Manchester United from 1986 to 2013. He is regarded by many players, managers and analysts to be one of the greatest and most successful managers of all time.

Figure 1: An example of a question and its answer contexts from the HotpotQA dataset requiring multi-hop reasoning and retrieval. The first reasoning hop is highlighted in green, the second hop in purple, and the entity connecting the two is highlighted in blue bold italics. In the first reasoning hop, one has to infer that the manager in question is Alex Ferguson. Without this knowledge, the second context cannot possibly be retrieved with confidence, as the question could refer to any of the club’s managers throughout its history. Therefore, an iterative retrieval is needed in order to correctly retrieve this context pair.

Another setting for QA is open-domain QA, in which questions are given without any accompanying contexts, and one is required to locate the relevant contexts to the questions from a large knowledge source (e.g., Wikipedia), and then extract the correct answer using an RC component. This task has recently been resurged following the work of Chen et al. (2017), who used a TF-IDF based retriever to find potentially relevant documents, followed by a neural RC component that extracted the most probable answer from the retrieved documents. While this methodology performs reasonably well for questions requiring single-hop reasoning, its performance decreases significantly when used for open-domain multi-hop reasoning.

We propose a new approach to accomplishing this task, called iterative multi-hop retrieval, in which one iteratively retrieves the necessary evidence to answer a question. We believe this iterative framework is essential for answering multi-hop questions, due to the nature of their reasoning requirements.

Our main contributions are the following:

  • We propose a novel multi-hop retrieval approach, which we believe is imperative for truly solving the open-domain multi-hop QA task.

  • We show the effectiveness of our approach, which achieves state-of-the-art results in both single- and multi-hop open-domain QA benchmarks.

  • We also propose using sentence-level representations for retrieval, and show the possible benefits of this approach over paragraph-level representations.

While there are several works that discuss solutions for multi-hop reasoning (Dhingra et al., 2018; Zhong et al., 2019), to the best of our knowledge, this work is the first to propose a viable solution for open-domain multi-hop QA.

2 Task Definition

We define the open-domain QA task by a triplet where is a background knowledge source and is a textual paragraph consisting of tokens, is a textual question consisting of tokens, and is a textual answer consisting of tokens, typically a span of tokens in some , or optionally a choice from a predefined set of possible answers. The objective of this task is to find the answer to the question using the background knowledge source . Formally speaking, our task is to learn a function such that .

Single-Hop Retrieval

In the classic and most simple form of QA, questions are formulated in such a way that the evidence required to answer them may be contained in a single paragraph, or even in a single sentence. Thus, in the open-domain setting, it might be sufficient to retrieve a single relevant paragraph using the information present in the given question , and have a reading comprehension model extract the answer from . We call this task variation single-hop retrieval.

Multi-Hop Retrieval

In contrast to the single-hop case, there are types of questions whose answers can only be inferred by using at least two different paragraphs. The ability to reason with information taken from more than one paragraph is known in the literature as multi-hop reasoning (Welbl et al., 2018). In multi-hop reasoning, not only might the evidence be spread across multiple paragraphs, but it is often necessary to first read a subset of these paragraphs in order to extract the useful information from the other paragraphs, which might otherwise be understood as not completely relevant to the question. This situation becomes even more difficult in the open-domain setting, where one must first find an initial evidence paragraph in order to be able to retrieve the rest. This is demonstrated in Figure 1, where one can observe that the second context alone may appear to be irrelevant to the question at hand and the information in the first context is necessary to retrieve the second part of the evidence correctly.

We extend the multi-hop reasoning ability to the open-domain setting, referring to it as multi-hop retrieval, in which the evidence paragraphs are retrieved in an iterative fashion. We focus on this task and limit ourselves to the case where two iterations of retrieval are necessary and sufficient.

Figure 2: A high-level overview of our solution, MUPPET.

3 Methodology

Our solution, which we call MUPPET (multi-hop paragraph retrieval), relies on the following basic scheme consisting of two main components: (a) a paragraph and question encoder, and (b) a paragraph reader. The encoder is trained to encode paragraphs into -dimensional vectors, and to encode questions into search vectors in the same vector space. Then, a maximum inner product search (MIPS) algorithm is applied to find the most similar paragraphs to a given question. Several algorithms exist for fast (and possibly approximate) MIPS, such as the one proposed by Johnson et al. (2017). The most similar paragraphs are then passed to the paragraph reader, which, in turn, extracts the most probable answer to the question.

It is critical that the paragraph encodings do not depend on the questions. This enables storing precomputed paragraph encodings and executing efficient MIPS when given a new search vector. Without this property, any new question would require the processing of the complete knowledge source (or a significant part of it).

To support multi-hop retrieval, we propose the following extension to the basic scheme. Given a question , we first obtain its encoding using the encoder. Then, we transform it into a search vector , which is used to retrieve the top- relevant paragraphs using MIPS. In each subsequent retrieval iteration, we use the paragraphs retrieved in its previous iteration to reformulate the search vector. This produces new search vectors, , where , which are used in the same manner as in the first iteration to retrieve the next top- paragraphs, again using MIPS. This method can be seen as performing a beam search of width in the encoded paragraphs’ space. A high-level view of the described solution is given in Figure 2.

(a) Sentence Encoder
(b) Reformulation Component
Figure 3: Architecture of the main components of our paragraph and question encoder. (a) Our sentence encoder architecture. The model receives a series of tokens as input and produces a sequence of sentence representations. (b) Our reformulation component architecture. This layer receives contextualized representations of a question and a paragraph, and produces a reformulated representation of the question.

3.1 Paragraph and Question Encoder

We define , our encoder model, in the following way. Given a paragraph consisting of sentences and tokens , such that , where is the length of the sentence, our encoder generates respective -dimensional encodings , one for each sentence. This is in contrast to previous work in paragraph retrieval in which only a single fixed-size representation is used for each paragraph (Lee et al., 2018; Das et al., 2019). The encodings are created by passing through the following layers.

Word Embedding

We use the same embedding layer as the one suggested by Clark and Gardner (2018). Each token is embedded into a vector using both character-level and word-level information. The word-level embedding is obtained via pretrained word embeddings. The character-level embedding of a token with characters is obtained in the following manner: each character is embedded into a fixed-size vector . We then pass each token’s character embeddings through a one-dimensional convolutional neural network, followed by max-pooling over the filter dimension. This produces a fixed-size character-level representation for each token, . Finally, we concatenate the word-level and character-level embeddings to form the final word representation, .

Recurrent Layer

After obtaining the word representations, we use a bidirectional GRU (Cho et al., 2014) to process the paragraph and obtain the contextualized word representations,

Sentence-wise max-pooling

Finally, we chunk the contextualized representations of the paragraph tokens into their corresponding sentence groups, and apply max-pooling over the time dimension of each sentence group to obtain the parargaph’s -dimensional sentence representations, . A high-level outline of the sentence encoder is shown is Figure 2(a), where we can see a series of tokens being passed through the aforementioned layers, producing sentence representations.

The encoding of a question is computed similarly, such that . Note that we produce a single vector for any given question, thus the max-pooling operation is applied over all question words at once, disregarding sentence information.

Context: One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Famous musicians include Władysław Szpilman and Frédéric Chopin. Though Chopin was born in the village of Żelazowa Wola, about 60 km (37 mi) from Warsaw, he moved to the city with his family when he was seven months old. Casimir Pulaski, a Polish general and hero of the American Revolutionary War, was born here in 1745. Question 1: What was Maria Curie the first female recipient of? Question 2: How old was Chopin when he moved to Warsaw with his family?

Figure 4: An example from the SQuAD dataset of a paragraph that acts as the context for two different questions. Question 1 and its evidence (highlighted in purple) have little relation to question 2 and its evidence (highlighted in green). This motivates our method of storing sentence-wise encodings instead of a single representation for an entire paragraph.
Reformulation Component

The reformulation component receives a paragraph and a question , and produces a single vector . First, contextualized word representations are obtained using the same embedding and recurrent layers used for the initial encoding, for and for . We then pass the contextualized representations through a bidirectional attention layer, which we adopt from Clark and Gardner (2018). The attention between question word and paragraph word is computed as:

where are learned vectors. For each question word, we compute the vector :

A paragraph-to-question vector is computed as follows:

We concatenate and and pass the result through a linear layer with ReLU activations to compute the final bidirectional attention vectors. We also use a residual connection where we process these representations with a bidirectional GRU and another linear layer with ReLU activations. Finally, we sum the outputs of the two linear layers. As before, we derive the -dimensional reformulated question representation using a max-pooling layer on the outputs of the residual layer. A high-level outline of the reformulation layer is given in Figure 2(b), where contextualized token representations of the question and contextualized token representations of the paragraph are passed through the component’s layers to produce the reformulated question representation, .

Relevance Scores

Given the sentence representations of a paragraph , and the question encoding for , the relevance score of with respect to a question is calculated in the following way:

where and are learned parameters.

A similar max-pooling encoding approach, along with the scoring layer’s structure, were proposed by Conneau et al. (2017) who showed their efficacy on various sentence-level tasks. We find this sentence-wise formulation to be beneficial because it suffices for one sentence in a paragraph to be relevant to a question for the whole paragraph to be considered as relevant. This allows more fine-grained representations for paragraphs and more accurate retrieval. An example of the benefits of using this kind of sentence-level model is given in Figure 4, where we see two questions answered by two different sentences. Our model allows each question to be similar only to parts of the paragraph, and not necessarily to all of it.

Search Vector Derivation

Recall that our retrieval algorithm is based on executing a MIPS in the paragraph encoding space. To derive such a search vector from the question encoding , we observe that:

Therefore, the final search vector of a question is . The same equations apply when predicting the relevance score for the second retrieval iteration, in which case is swapped with .

Training and Loss Functions

Each training sample consists of a question and two paragraphs, , where corresponds to a paragraph retrieved in the first iteration, and corresponds to a paragraph retrieved in the second iteration using the reformulated vector . is considered relevant if it constitutes one of the necessary evidence paragraphs to answer the question. is considered relevant only if and together constitute the complete set of evidence paragraphs needed to answer the question. Both iterations have the same form of loss functions, and the model is trained by optimizing the sum of the iterations’ losses.

Our training objective for each iteration is composed of two components: a binary cross-entropy loss function and a ranking loss function. The cross-entropy loss is defined as follows:

where is a binary label indicating the true relevance of to in the iteration in which is calculated, and is the number of samples in the current batch.

The ranking loss is computed in the following manner. First, for each question in a given batch, we find the mean of the scores given to positive and negative paragraphs for each question, and , where and are the number of positive and negative samples for , respectively. We then define the margin ranking loss (Socher et al., 2013) as


where is the number of distinct questions in the current batch, and is a hyperparameter. The final objective is the sum of the two losses:


where is a hyperparameter.

We note that we found it slightly beneficial to incorporate pretrained ELMo (Peters et al., 2018) embeddings in our model. For more detailed information of the implementation details and training process, please refer to Appendix C.

3.2 Paragraph Reader

The paragraph reader receives as input a question and a paragraph and extracts the most probable answer span to from . We use the S-norm model proposed by Clark and Gardner (2018). A detailed description of the model is given in Appendix A.


An input sample for the paragraph reader consists of a question and a single context . We optimize the same negative log-likelihood function used in the S-norm model for the span start boundaries:

where is the set of paragraphs paired with the same question , is the set of tokens that start an answer span in the -th paragraph, and is the score given to the -th token in the -th paragraph. The same formulation is used for the span end boundaries, so that the final objective function is the sum of the two: .

4 Experiments and Results

We test our approach on two datasets, and measure end-to-end QA performance using the standard exact match (EM) and F1 metrics, as well as the metrics proposed by Yang et al. (2018) for the HotpotQA dataset (see Appendix B).

4.1 Datasets


Yang et al. (2018) introduced a dataset of Wikipedia-based questions, which require reasoning over multiple paragraphs to find the correct answer. The dataset also includes hard supervision on sentence-level supporting facts, which encourages the model to give explainable answer predictions. Two benchmark settings are available for this dataset: (1) a distractor setting, in which the reader is given a question as well as a set of paragraphs that includes both the supporting facts and irrelevant paragraphs; (2) a full wiki setting, which is an open-domain version of the dataset. We use this dataset as our benchmark for the multi-hop retrieval setting. Several extensions must be added to the reader from Section 3.2 in order for it to be suitable for the HotpotQA dataset. A detailed description of our proposed extensions is given in Appendix B.


Chen et al. (2017) decoupled the questions from their corresponding contexts in the original SQuAD dataset (Rajpurkar et al., 2016), and formed an open-domain version of the dataset by defining an entire Wikipedia dump to be the background knowledge source from which the answer to the question should be extracted. We use this dataset to test the effectiveness of our method in a classic single-hop retrieval setting.

4.2 Experimental Setup

Search Hyperparameters

For our experiments in the multi-hop setting, we used a width of 8 in the first retrieval iteration. In all our experiments, unless stated otherwise, the reader is fed the top 45 paragraphs through which it reasons independently and finds the most probable answers. In addition, we found it beneficial to limit the search space of our MIPS retriever to a subset of the knowledge source, which is determined by a TF-IDF heuristic retriever. We define to be the size of the search space for retrieval iteration . As we will see, there is a trade-off for choosing various values of . A large value of offers the possibility of higher recall, whereas a small value of introduces less noise in the form of irrelevant paragraphs.

Knowledege Sources

For HotpotQA, our knowledge source is the same Wikipedia version used by Yang et al. (2018)333It has recently come to our attention that during our work, some details of the Wikipedia version have changed. Due to time limitations, we use the initial version description.. This version is a set of all of the first paragraphs in the entire Wikipedia. For SQuAD-Open, we use the same Wikipedia dump used by Chen et al. (2017). For both knowledge sources, the TF-IDF based retriever we use for search space reduction is the one proposed by Chen et al. (2017), which uses bigram hashing and TF-IDF matching. We note that in the HotpotQA Wikipedia version each document is a single paragraph, while in SQuAD-Open, the full Wikipedia documents are used.

Setting Method Answer Sup Fact Joint
distractor Baseline (Yang et al., 2018) 44.44 58.28 21.95 66.66 11.56 40.86
Our Reader 51.56 65.32 44.54 75.27 28.68 54.08
full wiki Baseline (Yang et al., 2018) 24.68 34.36 05.28 40.98 02.54 17.73
TF-IDF + Reader 27.55 36.58 10.75 42.45 07.00 21.47
MUPPET (sentence-level) 30.20 39.43 16.57 46.13 11.38 26.55
MUPPET (paragraph-level) 31.07 40.42 17.00 47.71 11.76 27.62
Table 1: Primary results for HotpotQA (dev set). At the top of the table, we compare our Paragraph Reader to the baseline model of Yang et al. (2018) (as of writing this paper, no other published results are available other than the baseline results). At the bottom, we compare the end-to-end performance on the full wiki setting. TF-IDF + Reader refers to using the TF-IDF based retriever without our MIPS retriever. MUPPET (sentence-level) refers to our approach with sentence-level representations, and MUPPET (paragraph-level) refers to our approach with paragraph-level representations. For both sentence- and paragraph-level results, we set and .
Method EM F1
DrQA (Chen et al., 2017) 28.4 -
DrQA (Chen et al., 2017) (multitask) 29.8 -
R3 (Wang et al., 2018a) 29.1 37.5
DS-QA (Lin et al., 2018) 28.7 36.6
Par. Ranker + Full Agg. (Lee et al., 2018) 30.2 -
Minimal (Min et al., 2018) 34.7 42.6
Multi-step (Das et al., 2019) 31.9 39.2
BERTserini (Yang et al., 2019) 38.6 46.1
TF-IDF + Reader 34.6 41.6
MUPPET (sentence-level) 39.3 46.2
MUPPET (paragraph-level) 35.6 42.5
Table 2: Primary results for SQuAD-Open.

4.3 Results

Primary Results

Tables 1 and 2 show our main results on the HotpotQA and SQuAD-Open datasets, respectively. In the HotpotQA distractor setting, our paragraph reader greatly improves the results of the baseline reader, increasing the joint EM and F1 scores by (148%) and (32%) points, respectively. In the full wiki setting, we compare three methods of retrieval: (1) TF-IDF, in which only the TF-IDF heuristic is used. The reader is fed all possible paragraph pairs from the top- paragraphs. (2) Sentence-level, in which we use MUPPET with sentence-level encodings. (3) Paragraph-level, in which we use MUPPET with paragraph-level encodings (no sentence information). We can see that both methods significantly outperform the naïve TF-IDF retriever, indicating the efficacy of our approach. As of writing this paper, we are placed second in the HotpotQA full wiki setting (test set) leaderboard444March 5, 2019. Leaderboard available at For SQuAD-Open, our sentence-level method established state-of-the-art results, improving the current non-BERT (Devlin et al., 2018) state-of-the-art by (13%) and (8%) EM and F1 points, respectively. This shows that our encoder can be useful not only for multi-hop questions, but also for single-hop questions.

Retrieval Recall Analysis

We analyze the performance of the TF-IDF retriever for HotpotQA in Figure 4(a). We can see that the retriever succeeds in retrieving at least one of the gold paragraphs for each question (above with the top- paragraphs), but fails at retrieving both gold paragraphs. This demonstrates the necessity of an efficient multi-hop retrieval approach to aid or replace classic information retrieval methods.

Effect of Narrowing the Search Space

(a) TF-IDF retrieval results
(b) SQuAD-Open
(c) HotpotQA
Figure 5: Various results based on the TF-IDF retriever. (a) Retrieval results of the TF-IDF hueristic retriever on HotpotQA. At Least One @ k is the number of questions for which at least one of the paragraphs containing the supporting facts is retrieved in the top- paragraphs. Potentially Perfect @ k is the number of questions for which both of the paragraphs containing the supporting facts are retrieved in the top- paragraphs. (b) and (c) Performance analysis on the SQuAD-Open and HotpotQA datasets, respectively, as more documents/paragraphs are retrieved by the TF-IDF heuristic retriever. Note that for SQuAD-Open each document contains several paragraphs, and the reader is fed the top- TF-IDF ranked paragraphs from within the documents in the search space.

In Figures 4(b) and 4(c), we show the performance of our method as a function of the size of the search space of the last retrieval iteration. For SQuAD-Open, the TF-IDF retriever initially retrieves a set of documents, which are then split into paragraphs to form the search space. Each search space of top- paragraphs limits the potential recall of the model to that of the top- paragraphs retrieved by the TF-IDF retriever. This proves to be suboptimal for very small values of , as the performance of the TF-IDF retriever is not good enough. Our models, however, fail to benefit from increasing the search space indefinitely, hinting that they are not as robust to noise as we would want them to be.

Effectiveness of Sentence-Level Encodings

Our method proposes using sentence-level encodings for paragraph retrieval. We test the significance of this approach in Figures 4(b) and 4(c). While sentence-level encodings seem to be vital for improving state-of-the-art results on SQuAD-Open, the same cannot be said about HotpotQA. We hypothesize that this is a consequence of the way the datasets were created. In SQuAD, each paragraph serves as the context of several questions, as shown in Figure 4. This leads to questions being asked about facts less essential to the gist of the paragraph, and thus they would not be encapsulated in a single paragraph representation. In HotpotQA, however, most of the paragraphs in the training set serve as the context of at most one question.

5 Related Work

Chen et al. (2017) first introduced the use of neural methods to the task of open-domain QA using a textual knowledge source. They proposed DrQA, a pipeline approach with two components: a TF-IDF based retriever, and a multi-layer neural network that was trained to find an answer span given a question and a paragraph. In an attempt to improve the retrieval of the TF-IDF based component, many existing works have used Distant Supervision (DS) to further re-rank the retrieved paragraphs (Htut et al., 2018; Yan et al., 2018). Wang et al. (2018a) used reinforcement learning to train a re-ranker and an RC component in an end-to-end manner, and showed its advantage over the use of DS alone. Min et al. (2018) trained a sentence selector and demonstrated the effectiveness of reading minimal contexts instead of complete documents. As DS can often lead to wrong labeling, Lin et al. (2018) suggested a denoising method for alleviating this problem. While these methods have proved to increase performance in various open-domain QA datasets, their re-ranking approach is limited in the number of paragraphs it can process, as it requires the joint reading of a question with all possible paragraphs. This is in contrast to our approach, in which all paragraph representations are precomputed to allow efficient large-scale retrieval. There are some works that adopted a similar precomputation scheme. Lee et al. (2018) learned an encoding function for questions and paragraphs and ranked paragraphs by their dot-product similarity with the question. Many of their improvements, however, can be attributed to the incorporation of answer aggregation methods as suggested by Wang et al. (2018b) in their model, which enhanced their results significantly. Seo et al. (2018) proposed phrase-indexed QA (PI-QA), a new formulation of the QA task that requires the independent encoding of answers and questions. The question encodings are then used to retrieve the correct answers by performing MIPS. This is more of a challenge task rather than a solution for open-domain QA. A recent work by Das et al. (2019) proposed a new framework for open-domain QA that employs a multi-step interaction between a retriever and a reader. This interactive framework is used to refine a question representation in order for the retrieval to be more accurate. Their method is complimentary to ours – the interactive framework is used to enhance retrieval performance for single-hop questions, and does not handle the multi-hop domain.

Another line of work reminiscent of our method is the one of Memory Networks (Weston et al., 2015). Memory Networks consist of an array of cells, each capable of storing a vector, and four modules (input, update, output and response) that allow the manipulation of the memory for the task at hand. Many variations of Memory Networks have been proposed, such as end-to-end Memory Networks (Sukhbaatar et al., 2015), Key-Value Memory Networks (Miller et al., 2016), and Hierarchical Memory Networks (Chandar et al., 2016).

6 Concluding Remarks

We present MUPPET, a novel method for multi-hop paragraph retrieval, and show its efficacy in both single- and multi-hop QA datasets. One difficulty in the open-domain multi-hop setting is the lack of supervision, a difficulty that in the single-hop setting is alleviated to some extent by using distant supervision. We hope to tackle this problem in future work to allow learning more than two retrieval iterations. An interesting improvement to our approach would be to allow the retriever to automatically determine whether or not more retrieval iterations are needed. A promising direction could be a multi-task approach, in which both single- and multi-hop datasets are learned jointly. We leave this for future work.


This research was partially supported by the Israel Science Foundation (grant No. 710/18).


Appendix A Paragraph Reader

In this section we describe in detail the reader mentioned in Section 3.2. The paragraph reader receives as input a question and a paragraph and extracts the most probable answer span to from . We use the shared-norm model presented by Clark and Gardner (2018), which we refer to as S-norm. The model’s architecture is quite similar to the one we used for the encoder. First, we process and seperately to obtain their contexualized token representations, in the same manner as used in the encoder. We then pass the contextualized representations through a bidirectional attention layer similar to the one defined in the reformulation layer of the encoder, with the only difference being that the roles of the question and the paragraph are switched. As before, we further pass the bidirectional attention representations through a residual connection, this time using a self-attention layer between the bidirectional GRU and the linear layer. The self-attention mechanism is similar to the bidirectional attention layer, only now it is between the paragraph and itself. Therefore, question-to-parargaph attention is not used, and we set if . The summed outputs of the residual connection are passed to the prediction layer. The inputs to the prediction layer are passed through a bidirectional GRU followed by a linear layer that predicts the answer span start scores. The hidden layers of that GRU are concatenated with the input and passed through another bidirectional GRU and linear layer to predict the answer span end scores.


An input sample for the paragraph reader consists of a question and a single context . We optimize the same negative log-likelihood function used in the S-norm model for the span start boundaries:

where is the set of paragraphs paired with the same question , is the set of tokens that start an answer span in the -th paragraph, and is the score given to the -th token in the -th paragraph. The same formulation is used for the span end boundaries, so that the final objective function is the sum of the two: .

Appendix B Paragraph Reader Extension for HotpotQA

HotpotQA presents the challenge of not only predicting an answer span, but also yes/no answers. This is a combination of span-based questions and multiple-choice questions. In addition, one is also required to provide explainability to the answer predictions by predicting the supporting facts leading to the answer. We extend the paragraph reader from Section 3.2 to support these predictions in the following manner.

Yes/No Prediction

We argue that one can decide whether the answer to a given question should be span-based or yes/no-based without looking at any context at all. Therefore, we first create a fixed-size vector representing the question using max-pooling over the first bidirectional GRU’s states of the question. We pass this representation through a linear layer that predicts whether this is a yes/no-based question or a span-based question. If span-based, we predict the answer span from the context using the original span prediction layer. If yes/no-based, we encode the question-aware context representations to a fixed-size vector by performing max-pooling over the outputs of the residual self-attention layer. As before, we then pass this vector through a linear layer to predict a yes/no answer.

Supporting Fact Prediction

As a context’s supporting facts for a question are at the sentence-level, we encode the question-aware context representations to fixed-size sentence representations by passing the outputs of the residual self-attention layer through another bidirectional GRU, followed by performing max-pooling over the sentence groups of the GRU’s outputs. Each sentence representation is then passed through a multilayer perceptron with a single hidden layer equipped with ReLU activations to predict whether it is indeed a supporting fact or not.


An input sample for the paragraph reader consists of a question and a single context, . Nevertheless, as HotpotQA requires multiple paragraphs to answer a question, we define to be the concatenation of these paragraphs.

Our objective function comprises four loss functions, corresponding to the four possible predictions of our model. For the span-based prediction we use , as before. We use a similar negative log likelihood loss for the answer type prediction (whether the answer should be span-based or yes/no-based) and for a yes/no answer prediction:

where is the set of paragraphs paired with the same question , and and are the likelihood scores of the -th question-paragraph pair being a binary yes/no-based type, a span-based type, and its true type, respectively. and are the likelihood scores of the -th question-paragraph pair having the answer ‘yes’, the answer ‘no’, and its true answer, respectively. For span-based questions, is defined to be zero, and vice-versa.

For the supporting fact prediction, we use a binary cross-entropy loss on each sentence, . The final loss function is the sum of these four objectives,

During inference, the supporting facts prediction is taken only from the paragraph from which the answer is predicted.


Three sets of metrics were proposed by Yang et al. (2018) to evaluate performance on the HotpotQA dataset. The first set of metrics focuses on evaluating the answer span. For this purpose the exact match (EM) and F1 metrics are used, as suggested by Rajpurkar et al. (2016). The second set of metrics focuses on the explainability of the models, by evaluating the supporting facts directly using the EM and F1 metrics on the set of supporting fact sentences. The final set of metrics combines the evaluation of answer spans and supporting facts as follows. For each example, given its precision and recall on the answer span and the supporting facts , respectively, the joint F1 is calculated as

The joint EM is 1 only if both tasks achieve an exact match and otherwise 0. Intuitively, these metrics penalize systems that perform poorly on either task. All metrics are evaluated example-by-example, and then averaged over examples in the evaluation set.

Appendix C Implementation Details

We use the Stanford CoreNLP toolkit (Manning et al., 2014) for tokenization. We implement all our models using TensorFlow.

Architecture Details

For the word-level embeddings, we use the GloVe 300-dimensional embeddings pretrained on the 840B Common Crawl corpus (Pennington et al., 2014). For the character-level embeddings, we use 20-dimensional character embeddings, and use a 1-dimensional CNN with 100 filters of size 5, with a dropout (Srivastava et al., 2014) rate of 0.2.

For the encoder, we also concatenate ELMo (Peters et al., 2018) embeddings with a dropout rate of 0.5 and the token representations from the output of embedding layer to form the final token representations, before processing them through the first bidirectional GRU. We use the ELMo weights pretrained on the 5.5B dataset.555Available at To speed up computations, we cache the context independent token representations of all tokens that appear at least once in the titles of the HotpotQA Wikipedia version, or appear at least five times in the entire Wikipedia version. Words not in this vocabulary are given a fixed OOV vector. We use a learned weighted average of all three ELMo layers. Variational dropout (Gal and Ghahramani, 2016), where the same dropout mask is applied at each time step, is applied on the inputs of all recurrent layers with a dropout rate of 0.2. We set the encoding size to be .

For the paragraph reader used for HotpotQA, we use a state size of 150 for the bidirectional GRUs. The size of the hidden layer in the MLP used for supporting fact prediction is set to 150 as well. Here again variational dropout with a dropout rate of 0.2 is applied on the inputs of all recurrent layers and attention mechanisms. The reader used for SQuAD is the shared-norm model trained on the SQuAD dataset by Clark and Gardner (2018).666Available at

Training Details

We train all our models using the Adadelta optimizer (Zeiler, 2012) with a learning rate of 1.0 and .

SQuAD-Open: The training data is gathered as follows. For each question in the original SQuAD dataset, the original paragraph given as the question’s context is considered as the single relevant (positive) paragraph. We gather 12 irrelevant (negative) paragraphs for each question in the following manner:

  • The three paragraphs with the highest TF-IDF similarity to the question in the same SQuAD document as the relevant paragraph (excluding the relevant paragraph). The same method is applied to retrieve the three paragraphs most similar to the relevant paragraph.

  • The two paragraphs with the highest TF-IDF similarity to the question from the set of all first paragraphs in the entire Wikipedia (excluding the relevant paragraph’s article). The same method is applied to retrieve the two paragraphs most similar to the relevant paragraph.

  • Two randomly sampled paragraphs from the entire Wikipedia.

Questions that contain only stop-words are dropped, as they are most likely too dependent on the original context and not suitable for open-domain. In each epoch, a question appears as a training sample four times; once with the relevant paragraph, and three times with randomly sampled irrelevant paragraphs.

We train with a batch size of 45, and do not use the ranking loss by setting in Equation (2). We limit the length of the paragraphs to 600 tokens.

HotpotQA: The paragraphs used for training the encoder are the gold and distractor paragraphs supplied in the original HotpotQA training set. As mentioned in Section 3.1, each training sample consists of a question and two paragraphs, , where corresponds to a paragraph retrieved in the first iteration, and corresponds to a paragraph retrieved in the second iteration. For each question, we create the following sample types:

  1. Gold: The two paragraphs are the two gold paragraphs of the question. Both and are considered positive.

  2. First gold, second distractor: is one of the gold paragraphs and considered positive, while can be a random paragraph from the training set, the same as , or one of the distractors, with probabilities 0.05, 0.1 and 0.85, respectively. is considered negative.

  3. First distractor, second gold: is either one of the distractors or a random paragraph from the training set, with probabilities 0.9 and 0.1, respectively. is one of the gold paragraphs. Both and are considered negative.

  4. All distractors: Both and are sampled from the question’s distractors, and are considered negative.

  5. Gold from another question: A gold paragraph pair taken from another question; both paragraphs are considered negative.

The use of the sample types from the above list motivation is motivated as follows. Sample type 1 is the only one that contains purely positive examples and hence is mandatory. Sample type 2 is necessary to allow the model to learn a valuable reformulation, which does not give a relevant score based solely on the first paragraph. Sample type 3 is complementary to type 2; it allows the model to learn that a paragraph pair is irrelevant if the first paragraph is irrelevant, regardless of the second. Sample type 3 is used for random negative sampling, which is the most common case of all. Sample type 4 is used to guarantee the model does not determine relevancy solely based on the paragraph pair, but also based on the question.

In each training batch, we include three samples for each question in the batch: a single gold sample (type 1), and two samples from the other four types, with sample probabilities of 0.35, 0.35, 0.25 and 0.05, respectively.

We use a batch size of 75 (25 unique questions). We set the margin to be in Equation (1) and in Equation (2), for both prediction iterations. We limit the length of the paragraphs to 600 tokens.

HotpotQA Reader: The reader receives a question and a concatenation of a paragraph pair as input. Each training batch consists of three samples with three different paragraph pairs for each question: a single gold pair, which is the two gold paragraphs of the question, and two randomly sampled paragraph pairs from the set of the distractors and one of the gold paragraphs of the question. We label the correct answer spans to be every text span that has an exact match with the ground truth answer, even in the distractor paragraphs. We use a batch size of 75 (25 unique questions), and limit the length of the paragraphs (before concatenation) to 600 tokens.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description