Making Neural QA as Simple as Possible but not Simpler

# Making Neural QA as Simple as Possible but not Simpler

## Abstract

Recent development of large-scale question answering (QA) datasets triggered a substantial amount of research into end-to-end neural architectures for QA. Increasingly complex systems have been conceived without comparison to simpler neural baseline systems that would justify their complexity. In this work, we propose a simple heuristic that guides the development of neural baseline systems for the extractive QA task. We find that there are two ingredients necessary for building a high-performing neural QA system: first, the awareness of question words while processing the context and second, a composition function that goes beyond simple bag-of-words modeling, such as recurrent neural networks. Our results show that FastQA, a system that meets these two requirements, can achieve very competitive performance compared with existing models. We argue that this surprising finding puts results of previous systems and the complexity of recent QA datasets into perspective.

## 1Introduction

Question answering is an important end-user task at the intersection of natural language processing (NLP) and information retrieval (IR). QA systems can bridge the gap between IR-based search engines and sophisticated intelligent assistants that enable a more directed information retrieval process. Such systems aim at finding precisely the piece of information sought by the user instead of documents or snippets containing the answer. A special form of QA, namely extractive QA, deals with the extraction of a direct answer to a question from a given textual context.

The creation of large-scale, extractive QA datasets [?] sparked research interest into the development of end-to-end neural QA systems. A typical neural architecture consists of an embedding-, encoding-, interaction- and answer layer [?]. Most such systems describe several innovations for the different layers of the architecture with a special focus on developing powerful interaction layer that aims at modeling word-by-word interaction between question and context.

Although a variety of extractive QA systems have been proposed, there is no competitive neural baseline. Most systems were built in what we call a top-down process that proposes a complex architecture and validates design decisions by an ablation study. Most ablation studies, however, remove only a single part of an overall complex architecture and therefore lack comparison to a reasonable neural baseline. This gap raises the question whether the complexity of current systems is justified solely by their empirical results.

Another important observation is the fact that seemingly complex questions might be answerable by simple heuristics. Let’s consider the following example:

Although it seems that evidence synthesis of multiple sentences is necessary to fully understand the relation between the answer and the question, answering this question is easily possible by applying a simple context/type matching heuristic. The heuristic aims at selecting answer spans that a) match the expected answer type (a time as indicated by “When”) and b) are close to important question words (“St. Kazimierz Church”). The actual answer “1688-1692” would easily be extracted by such a heuristic.

In this work, we propose to use the aforementioned context/type matching heuristic as a guideline to derive simple neural baseline architectures for the extractive QA task. In particular, we develop a simple neural, bag-of-words (BoW)- and a recurrent neural network (RNN) baseline, namely FastQA. Crucially, both models do not make use of a complex interaction layer but model interaction between question and context only through computable features on the word level. FastQA’s strong performance questions the necessity of additional complexity, especially in the interaction layer, which is exhibited by recently developed models. We address this question by evaluating the impact of extending FastQA with an additional interaction layer (FastQAExt) and find that it doesn’t lead to systematic improvements. Finally, our contributions are the following: i) definition and evaluation of a BoW- and RNN-based neural QA baselines guided by a simple heuristic; ii) bottom-up evaluation of our FastQA system with increasing architectural complexity, revealing that the awareness of question words and the application of a RNN are enough to reach state-of-the-art results; iii) a complexity comparison between FastQA and more complex architectures as well as an in-depth discussion of usefulness of an interaction layer; iv) a qualitative analysis indicating that FastQA mostly follows our heuristic which thus constitutes a strong baseline for extractive QA.

## 2A Bag-of-Words Neural QA System

We begin by motivating our architectures by defining our proposed context/type matching heuristic: a) the type of the answer span should correspond to the expected answer type given by the question, and b) the correct answer should further be surrounded by a context that fits the question, or, more precisely, it should be surrounded by many question words. Similar heuristics were frequently implemented explicitly in traditional QA systems, e.g., in the answer extraction step of , however, in this work our heuristic is merely used as a guideline for the construction of neural QA systems. In the following, we denote the hidden dimensionality of the model by , the question tokens by , and the context tokens by .

### 2.1Embedding

The embedding layer is responsible for mapping tokens to their corresponding -dimensional representation . Typically this is done by mapping each word to its corresponding word embedding (lookup-embedding) using an embedding matrix , s.t. . Another approach is to embed each word by encoding their corresponding character sequence with , s.t. (char-embedding). In this work, we use a convolutional neural network for of filter width with max-pooling over time as explored by , to which we refer the reader for additional details. Both approaches are combined via concatenation, s.t. the final embedding becomes .

### 2.2Type Matching

For the BoW baseline, we extract the span in the question that refers to the expected, lexical answer type (LAT) by extracting either the question word(s) (e.g., who, when, why, how, how many, etc.) or the first noun phrase of the question after the question words “what” or “which” (e.g., “what year did...“).1 This leads to a correct LAT for most questions. We encode the LAT by concatenating the embedding of the first- and last word together with the average embedding of all words within the LAT. The concatenated representations are further transformed by a fully-connected layer followed by a non-linearity into . Note that we refer to a fully-connected layer in the following by , s.t. , .

We similarly encode each potential answer span in the context, i.e., all spans with a specified, maximum number of words ( in this work), by concatenating the embedding of the first- and last word together with the average embedding of all words within the span. Because the surrounding context of a potential answer span can give important clues towards the type of an answer span, for instance, through nominal modifiers left of the span (e.g., “... president ...“) or through an apposition right of the span (e.g., “... , president of...“), we additionally concatenate the average embeddings of the words to the left and to the right of a span, respectively. The concatenated span representation, which comprises in total five different embeddings, is further transformed by a fully-connected layer with a non-linearity into .

Finally, the concatenation of the LAT representation, the span representation and their element-wise product, i.e., , serve as input to a feed-forward neural network with one hidden layer which computes the type score for each span .

### 2.3Context Matching

In order to account for the number of surrounding words of an answer span as a measure for question to answer span match (context match), we introduce two word-in-question features. They are computed for each context word and explained in the following

binary The binary word-in-question () feature is for tokens that are part of the question and else . The following equation formally defines this feature where denotes the indicator function:

weighted The feature for context word is defined in Eq. ?, where Equation 1 defines a basic similarity score between and based on their word-embeddings. It is motivated on the one hand by the intuition that question tokens which rarely appear in the context are more likely to be important for answering the question, and on the other hand by the fact that question words might occur as morphological variants, synonyms or related words in the context. The latter can be captured (softly) by using word embeddings instead of the words themselves whereas the former is captured by the application of the operation in Eq. ? which ensures that infrequent occurrences of words are weighted more heavily.

A derivation that connects with the term-frequencies (a prominent information retrieval measure) of a word in the question and the context, respectively, is provided in Appendix A.

Finally, for each answer span we compute the average and scores of the , and token-windows to the left and to the right of the respective -span. This results in a total of (kinds of features) (windows) (left/right) scores which are weighted by trainable scalar parameters and summed to compute the context-matching score .

The final score for each span is the sum of the type- and the context matching score: . The model is trained to minimize the -cross-entropy loss given the scores for all spans.

## 3FastQA

Although our BoW baseline closely models our intended heuristic, it has several shortcomings. First of all, it cannot capture the compositionality of language making the detection of sensible answer spans harder. Furthermore, the semantics of a question is dramatically reduced to a BoW representation of its expected answer-type and the scalar word-in-question features. Finally, answer spans are restricted to a certain length.

To account for these shortcomings we introduce another baseline which relies on the application of a single bi-directional recurrent neural networks (BiRNN) followed by a answer layer that separates the prediction of the start and end of the answer span. demonstrated that BiRNNs are powerful at recognizing named entities which makes them sensible choice for context encoding to allow for improved type matching. Context matching can similarly be achieved with a BiRNN by informing it of the locations of question tokens appearing in the context through our -features. It is important to recognize that our model should implicitly learn to capture the heuristic, but is not limited by it.

On an abstract level, our RNN-based model, called FastQA, consists of three basic layers, namely the embedding-, encoding- and answer layer. Embeddings are computed as explained in §Section 2.1. The other two layers are described in detail in the following. An illustration of the basic architecture is provided in Figure 1.

### 3.1Encoding

In the following, we describe the encoding of the context which is analogous to that of the question.

To allow for interaction between the two embeddings described in §Section 2.1, they are first projected jointly to a -dimensional representation (Eq. Equation 2)) and further transformed by a single highway layer (Eq. ?) similar to .

Because we want the encoder to be aware of the question words we feed the binary- and the weighted word-in-question feature of §Section 2.3 in addition to the embedded context words as input. The complete input to the encoder is therefore defined as follows:

is fed to a bidirectional RNN and its output is again projected to allow for interaction between the features accumulated in the forward and backward RNN (Eq. Equation 3). In preliminary experiments we found LSTMs [?] to perform best.

We initialize the projection matrix with , where denotes the -dimensional identity matrix. It follows that is the sum of the outputs from the forward- and backward-LSTM at the beginning of training.

As mentioned before, we utilize the same encoder parameters for both question and context, except the projection matrix which is not shared. However, they are initialized the same way, s.t. the context and question encoding is identical at the beginning of training. Finally, to be able to use the same encoder for both question and context we fix the two features to for the question.

After encoding context to and the question to , we first compute a weighted, -dimensional question representation of (Eq. Equation 4). Note that this representation is context-independent and as such only computed once, i.e., there is no additional word-by-word interaction between context and question.

The probability distribution for the start location of the answer is computed by a 2-layer feed-forward neural network with a rectified-linear (ReLU) activated, hidden layer as follows:

The conditional probability distribution for the end location conditioned on the start location is computed similarly by a feed-forward neural network with hidden layer as follows:

The overall probability of predicting an answer span is . The model is trained to minimize the cross-entropy loss of the predicted span probability .

Beam-search During prediction time, we compute the answer span with the highest probability by employing beam-search using a beam-size of . This means that ends for the top- starts are predicted and the span with the highest overall probability is predicted as final answer.

## 4Comparison to Prior Architectures

Many neural architectures for extractive QA have been conceived very recently. Most of these systems can be broken down into four basic layers for which individual innovations were proposed. A high-level illustration of these systems is show in Figure 2. In the following, we compare our system in more detail with existing models.

Embedder The embedder is responsible for embedding a sequence of tokens into a sequence of -dimensional states. Our proposed embedder (§Section 2.1) is very similar to existing ones used for example in .

Encoder Embedded tokens are further encoded by some form of composition function. A prominent type of encoder is the (bi-directional) recurrent neural network (RNN) which is also used in this work. Feeding additional word-level features similar to ours is rarely done with the exception of .

Interaction Layer Most research focused on the interaction layer which is responsible for word-by-word interaction between context and question. Different ideas were explored such as attention [?], co-attention [?], bi-directional attention flow [?], multi-perspective context matching [?] or fine-grained gating [?]. All of these ideas aim at enriching the encoded context with weighted states from the question and in some cases also from the context. These are gathered individually for each context state, concatenated with it and serve as input to an additional RNN. Note that this layer is omitted completely in FastQA and therefore constitutes the main simplification over previous work.

Answer Layer Finally, most systems divide the prediction the start and the end by another network. Their complexity ranges from using a single fully-connected layer [?] to employing convolutional neural networks [?] or recurrent, deep Highway-Maxout-Networks[?]. We further introduce beam-search to extract the most likely answer span with a simple 2-layer feed-forward network.

## 5FastQA Extended

To explore the necessity of the interaction layer and to be architecturally comparable to existing models we extend FastQA with an additional interaction layer (FastQAExt). In particular, we introduce representation fusion to enable the exchange of information in between passages of the context (intra-fusion), and between the question and the context (inter-fusion). Representation fusion is defined as the weighted addition between a state, i.e., its -dimensional representation, and its respective co-representation. For each context state its corresponding co-representation is retrieved via attention from the rest of the context (intra) or the question (inter), respectively, and “fused” into its own representation. For the sake of brevity we describe technical details of this layer in Appendix B, because this extension is not the focus of this work but merely serves as a representative of the more complex architectures described in §Section 4.

## 6Experimental Setup

We conduct experiments on the following datasets.

NewsQA The NewsQA dataset [?]3 contains answerable questions from a total of questions. The dataset is built from CNN news stories that were originally collected by .

Performance on the SQuAD and NewsQA datasets is measured in terms of exact match (accuracy) and a mean, per answer token-based F1 measure which was originally proposed by to also account for partial matches.

### 6.1Implementation Details

BoW Model The BoW model is trained on spans up to length 10 to keep the computation tractable. This leads to an upper bound of about accuracy on SQuAD and on NewsQA. As pre-processing steps we lowercase all inputs and tokenize it using spacy4. The binary word in question feature is computed on lemmas provided by spacy and restricted to alphanumeric words that are not stopwords. Throughout all experiments we use a hidden dimensionality of , dropout at the input embeddings with the same mask for all words [?] and a rate of and -dimensional fixed word-embeddings from Glove [?]. We employed ADAM [?] for optimization with an initial learning-rate of which was halved whenever the measure on the development set dropped between epochs. We used mini-batches of size .

FastQA The pre-processing of FastQA is slightly simpler than that of the BoW model. We tokenize the input on whitespaces (exclusive) and non-alphanumeric characters (inclusive). The binary word in question feature is computed on the words as they appear in context. Throughout all experiments we use a hidden dimensionality of , variational dropout at the input embeddings with the same mask for all words [?] and a rate of and -dimensional fixed word-embeddings from Glove [?]. We employed ADAM [?] for optimization with an initial learning-rate of which was halved whenever the measure on the development set dropped between checkpoints. Checkpoints occurred after every mini-batches each containing examples.

Cutting Context Length Because NewsQA contains examples with very large contexts (up to more than 1500 tokens) we cut contexts larger than tokens in order to efficiently train our models. We ensure that at least one, but at best all answers are still present in the remaining tokens. Note that this restriction is only employed during training.

## 7Results

### 7.1Model Component Analysis

Table 1 shows the individual contributions of each model component that was incrementally added to a plain BiLSTM model without features, character embeddings and beam-search. We see that the most crucial performance boost stems from the introduction of either one of our features ( F1). However, all other extensions also achieve notable improvements typically between and F1. Beam-search slightly improves results which shows that the most probable start is not necessarily the start of the best answer span.

In general, these results are interesting in many ways. For instance, it is surprising that a simple binary feature like can have such a dramatic effect on the overall performance. We believe that the reason for this is the fact that an encoder without any knowledge of the actual question has to account for every possible question that might be asked, i.e., it has to keep track of the entire context around each token in its recurrent state. An informed encoder, on the other hand, can selectively keep track of question related information. It can further abstract over concrete entities to their respective types because it is rarely the case that many entities of the same type occur in the question. For example, if a person is mentioned in the question the context encoder only needs to remember that the “question-person” was mentioned but not the concrete name of the person.

Another interesting finding is the fact that additional character based embeddings have a notable effect on the overall performance which was already observed by . We see further improvements when employing representation fusion to allow for more interaction. This shows that a more sophisticated interaction layer can help. However, the differences are not substantial, indicating that this extension does not offer any systematic advantage.

### 7.2Comparing to State-of-the-Art

Our neural BoW baseline achieves good results on both datasets (Tables Table 3 and Table 1)5. For instance, it outperforms a feature rich logistic-regression baseline on the SQuAD development set (Table Table 1) and nearly reaches the BiLSTM baseline system (i.e., FastQA without character embeddings and features). It shows that more than half or more than a third of all questions in SQuAD or NewsQA, respectively, are (partially) answerable by a very simple neural BoW baseline. However, the gap to state-of-the-art systems is quite large (F1) which indicates that employing more complex composition functions than averaging, such as RNNs in FastQA, are indeed necessary to achieve good performance.

Results presented in Tables Table 2 and Table 3 clearly demonstrate the strength of the FastQA system. It is very competitive to previously established state-of-the-art results on the two datasets and even improves those for NewsQA. This is quite surprising when considering the simplicity of FastQA putting existing systems and the complexity of the datasets, especially SQuAD, into perspective. Our extended version FastQAExt achieves even slightly better results outperforming all reported results prior to submitting our model on the very competitive SQuAD benchmark.

In parallel to this work introduced a very similar model to FastQA, which relies on a few more hand-crafted features and a 3-layer encoder instead of a single layer in this work. These changes result in slightly better performance which is in line with the observations in this work.

### 7.3Do we need additional interaction?

In order to answer this question we compare FastQA, a system without a complex word-by-word interaction layer, to representative models that have an interaction layer, namely FastQAExt and the Dynamic Coattention Network (DCN, ). We measured both time- and space-complexity of FastQAExt and a reimplementation of the DCN in relation to FastQA and found that FastQA is about twice as fast as the other two systems and requires less memory compared to FastQAExt and DCN, respectively6.

In addition, we looked for systematic advantages of FastQAExt over FastQA by comparing SQuAD examples from the development set that were answered correctly by FastQAExt and incorrectly by FastQA ( FastQAExt wins) against FastQA wins (). We studied the average question- and answer length as well as the question types for these two sets but could not find any systematic difference. The same observation was made when manually comparing the kind of reasoning that is needed to answer a certain question. This finding aligns with the marginal empirical improvements, especially for NewsQA, between the two systems indicating that FastQAExt seems to generalize slightly better but does not offer a particular, systematic advantage. Therefore, we argue that the additional complexity introduced by the interaction layer is not necessarily justified by the incremental performance improvements presented in §Section 7.2, especially when memory or run-time constraints exist.

### 7.4Qualitative Analysis

Besides our empirical evaluations this section provides a qualitative error inspection of predictions for the SQuAD development dataset. We analyse errors made by the FastQA system in detail and highlight basic abilities that are missing to reach human level performance.

We found that most errors are based on a lack of either syntactic understanding or a fine-grained semantic distinction between lexemes with similar meanings. Other error types are mostly related to annotation preferences, e.g., answer is good but there is a better, more specific one, or ambiguities within the question or context.

A prominent type of mistake is a lack of fine-grained understanding of certain answer types (Ex. 1). Another error is the lack of co-reference resolution and context sensitive binding of abbreviations (Ex. 2). We also find that the model sometimes struggles to capture basic syntactic structure, especially with respect to nested sentences where important separators like punctuation and conjunctions are being ignored (Ex. 3).

A manual examination of errors reveals that about out of mistakes () can directly be attributed to the plain application of our heuristic. A similar analysis reveals that about out of () analyzed positive cases are covered by our heuristic as well. We therefore believe that our model and, wrt. empirical results, other models as well mostly learn a simple context/type matching heuristic.

This finding is important because it reveals that an extractive QA system does not have to solve the complex reasoning types of that were used to classify SQuAD instances [?], in order to achieve current state-of-the-art results.

## 8Related Work

The creation of large scale cloze datasets such the DailyMail/CNN dataset [?] or the Children’s Book Corpus [?] paved the way for the construction of end-to-end neural architectures for reading comprehension. A thorough analysis by , however, revealed that the DailyMail/CNN was too easy and still quite noisy. New datasets were constructed to eliminate these problems including SQuAD [?], NewsQA [?] and MsMARCO [?].

Previous question answering datasets such as MCTest [?] and TREC-QA [?] were too small to successfully train end-to-end neural architectures such as the models discussed in §Section 4 and required different approaches. Traditional statistical QA systems (e.g., ) relied on linguistic pre-processing pipelines and extensive exploitation of external resources, such as knowledge bases for feature-engineering. Other paradigms include template matching or passage retrieval [?].

## 9Conclusion

In this work, we introduced a simple, context/type matching heuristic for extractive question answering which serves as guideline for the development of two neural baseline system. Especially FastQA, our RNN-based system turns out to be an efficient neural baseline architecture for extractive question answering. It combines two simple ingredients necessary for building a currently competitive QA system: a) the awareness of question words while processing the context and b) a composition function that goes beyond simple bag-of-words modeling. We argue that this important finding puts results of previous, more complex architectures as well as the complexity of recent QA datasets into perspective. In the future we want to extend the FastQA model to address linguistically motivated error types of §Section 7.4.

## Acknowledgments

We thank Sebastian Riedel, Philippe Thomas, Leonhard Hennig and Omer Levy for comments on an early draft of this work as well as the anonymous reviewers for their insightful comments. This research was supported by the German Federal Ministry of Education and Research (BMBF) through the projects ALL SIDES (01IW14002), BBDC (01IS14013E), and Software Campus (01IS12050, sub-project GeNIE).

## AWeighted Word-in-Question to Term Frequency

In this section we explain the connection between the weighted word-in-question feature (§Section 2.3) defined in Eq. ? and the term frequency () of a word occurring in the question and context , respectively. To facilitate this analysis, we repeat the equations at this point:

Let us assume that we re-define the similarity score of Equation 1 as follows:

Given the new (discrete) similarity score we can derive the following equation for the feature for context word . Note that we refer to the term frequency of a word in the context and question by and , respectively.

Our derived formula shows that of context word would become a simple combination of the term frequencies of within the context and question if our similarity score is redefined as in Equation 5. Note that this holds true for any finite value chosen in Equation 5 and not just .

## BRepresentation Fusion

### b.1Intra-Fusion

It is well known that recurrent neural networks have a limited ability to model long-term dependencies. This limitation is mainly due to the information bottleneck that is posed by the fixed size of the internal RNN state. Hence, it is hard for our proposed baseline model to answer questions that require synthesizing evidence from different text passages. Such passages are typically connected via co-referent entities or events. Consider the following example from the NewsQA dataset [?]:

To correctly answer this question the representations of Rochester, New York should contain the information that it refers to Brittanee Drexel. This connection can, for example, be established through the mention of mother and its co-referent mention mom. Fusing information from the context representation into allows crucial information about the mentioning of Brittanee Drexel to flow close to the correct answer. We enable the model to find co-referring mentions via a normalized similarity measure (Eq. Equation 6). For each context state we retrieve its co-state using (Eq. ?) and finally fuse the representations of each state with their respective co-state representations via a gated addition (Eq. ?). We call this procedure associative representation fusion.

We initialize with , s.t. is initially identical to the dot-product between hidden states.

We further introduce recurrent representation fusion to sequentially propagate information gathered by associative fusion between neighbouring tokens, e.g., between the representation of mother containing additional information about Brittanee Drexel and those representations of Rochester, New York. This is achieved via a recurrent backward- (Eq. Equation 7) followed by a recurrent forward fusion (Eq. ?)

Note, that during representation fusion no new features are computed but simply combined with each other.

### b.2Inter-Fusion

The representation fusion between question and context states is very similar to the intra-fusion procedure. It is applied on top of the context representations after intra-fusion has been employed. Associative fusion is performed via attention weights (Eq. Equation 8) between question states and context states . The co-state is computed for each context state via (Eq. ?).

Note, because the normalization is applied over the all context tokens for each question word, will be close to zero for most context positions and therefore, its co-state will be close to a zero-vector. Therefore, only question related context states will receive a non-empty co-state. The rest of inter-fusion follows the same procedure as for intra-fusion and the resulting context representations serve as input to the answer layer.

In contrast to existing interaction layers which typically combine representations retrieved through attention by concatenation and feed them as input to an additional RNN (e.g., LSTM), our approach can be considered a more light-weight version of interaction.

Although our system was designed to answer question by extracting answers from a given context it can also be employed for generative question answering datasets such as the Microsoft Machine Reading Comprehension (MsMARCO, )7. MsMARCO contains real world queries from the Bing search engine and human generated answers that are based on relevant web documents. Because we focus on extractive question answering in this work, we limit the queries for training to queries whose answers are directly extractable from the given web documents. We found that of all queries fall into this category. Evaluation, however, is performed on the entire development and test set, respectively, which makes it impossible to answer the subset of Yes/No questions () properly. For the sake of simplicity, we concatenate all given paragraphs and treat them as a single document. Since most queries in MsMARCO are lower-cased we also lower-cased the context. The official scoring measure of MsMARCO for generative models is ROUGE-L and BLEU-1. Even though our model is extractive we use our extracted answers as if they were generated.

The results are shown in Table 4. The strong performance of our purely extractive system on the generative MsMARCO dataset is notable. It shows that answers to Bing queries can mostly be extracted directly from web documents without the need for a more complex generative approach. Since this was only an initial experiment on generative QA using extractive QA and the methodology used for training, pre- and post-processing on this dataset for the other models, especially for , is unclear, the comparability to the other QA systems is limited.

### Footnotes

1. More complex heuristics can be employed here but for the sake of simplicity we chose a very simple approach.