Stochastic Answer Networks for SQuAD 2.0

Stochastic Answer Networks for SQuAD 2.0

Xiaodong Liu, Wei Li, Yuwei Fang, Aerin Kim, Kevin Duh and Jianfeng Gao
Microsoft Research, Redmond, WA, USA
Johns Hopkins University, Baltimore, MD, USA

This paper presents an extension of the Stochastic Answer Network (SAN), one of the state-of-the-art machine reading comprehension models, to be able to judge whether a question is unanswerable or not. The extended SAN contains two components: a span detector and a binary classifier for judging whether the question is unanswerable, and both components are jointly optimized. Experiments show that SAN achieves the results competitive to the state-of-the-art on Stanford Question Answering Dataset (SQuAD) 2.0. To facilitate the research on this field, we release our code:

Stochastic Answer Networks for SQuAD 2.0

Xiaodong Liu, Wei Li, Yuwei Fang, Aerin Kim, Kevin Duh and Jianfeng Gao Microsoft Research, Redmond, WA, USA Johns Hopkins University, Baltimore, MD, USA {xiaodl,wli,yuwfan,ahkim,jfgao}

1 Background

Teaching machine to read and comprehend a given passage/paragraph and answer its corresponding questions is a challenging task. It is also one of the long-term goals of natural language understanding, and has important applications in e.g., building intelligent agents for conversation and customer service support. In a real world setting, it is necessary to judge whether the given questions are answerable given the available knowledge, and then generate correct answers for the ones which are able to infer an answer in the passage or an empty answer (as an unanswerable question) otherwise.

trim=.050pt .010pt .010pt .010pt,clip

Figure 1: Examples from SQuAD v2.0. The first question is answerable which indicates its answer highlighted in blue can be found in the paragraph; while the second question is unanswerable and its plausible answer is highlighted in red.

In comparison with many existing MRC systems (Wang and Jiang, 2016; Liu et al., 2018b; Yu et al., 2018; Seo et al., 2016; Shen et al., 2017), which extract answers by finding a sub-string in the passages/paragraphs, we propose a model that not only extracts answers but also predicts whether such an answer should exist. Using a multi-task learning approach (c.f. (Liu et al., 2015)), we extend the Stochastic Answer Network (SAN) (Liu et al., 2018b) for MRC answer span detector to include a classifier that whether the question is unanswerable. The unanswerable classifier is a pair-wise classification model (Liu et al., 2018a) which predicts a label indicating whether the given pair of a passage and a question is unanswerable. The two models share the same lower layer to save the number of parameters, and separate the top layers for different tasks (the span detector and binary classifier).

trim=.010pt .020pt .050pt .020pt,clip

Figure 2: Architecture of the proposed model for Reading Comprehension: It includes two components: a span detector (the upper left SAN answer module) and an unanswerable classifier (the upper right module). It contains two sets of layers: the shared layers including a lexicon encoding layer, contextual encoding layer and memory generation layer; and the task specific layers including the SAN answer module for span detection, and a binary classifier determining whether the question is unanswerable. The model is learned jointly.

Our model is pretty simple and intuitive, yet efficient. Without relying on the large pre-trained language models (ELMo) (Peters et al., 2018), the proposed model achieves competitive results to the state-of-the-art on Stanford Question Answering Dataset (SQuAD) 2.0.

The contribution of this work is summarized as follows. First, we propose a simple yet efficient model for MRC that handles unanswerable questions and is optimized jointly. Second, our model achieves competitive results on SQuAD v2.0.

2 Model

The Machine Reading Comprehension is a task which takes a question and a passage/paragraph as inputs, and aims to find an answer span in . We assume that if the question is answerable, the answer exists in as a contiguous text string; otherwise, is an empty string indicating an unanswerable question. Note that to handle the unanswerable questions, we manually append a dumpy text string NULL at the end of each corresponding passage/paragraph. Formally, the answer is formulated as . In case of unanswerable questions, points to the last token of the passage.

Our model is a variation of SAN (Liu et al., 2018b), as shown in Figure 2. The main difference is the additional binary classifier added in the model justifying whether the question is unanswerable. Roughly, the model includes two different layers: the shared layer and task specific layer. The shared layer is almost identical to the lower layers of SAN, which has a lexicon encoding layer, a contextual layer and a memory generation layer. On top of it, there are different answer modules for different tasks. We employ the SAN answer module for the span detector and a one-layer feed forward neural network for the binary classification task. It can also be viewed as a multi-task learning (Caruana, 1997; Liu et al., 2015; Xu et al., 2018). We will briefly describe the model from ground up as follows. Detailed descriptions can be found in (Liu et al., 2018b).

Lexicon Encoding Layer. We map the symbolic/surface feature of and into neural space via word embeddings 111We use 300-dim GloVe (Pennington et al., 2014) vectors., 16-dim part-of-speech (POS) tagging embeddings, 8-dim named-entity embeddings and 4-dim hard-rule features222It includes 3 matching features which are determined based on the original word, lower case, and lemma, respectively, and one term sequence feature.. Note that we use small embedding size of POS and NER to reduce model size and they mainly serve the role of coarse-grained word clusters. Additionally, we use question enhanced passages word embeddings which can viewwed as soft matching between questions and passages. At last, we use two separate two-layer position-wise Feed-Forward Networks (FFN) (Vaswani et al., 2017; Liu et al., 2018b) to map both question and passage encodings into the same dimension. As results, we obtain the final lexicon embeddings for the tokens for as a matrix , and tokens in as .

Contextual Encoding Layer. A shared two-layers BiLSTM is used on the top to encode the contextual information of both passages and questions. To avoid overfitting, we concatenate a pre-trained 600-dimensional CoVe vectors333 (McCann et al., 2017) trained on German-English machine translation dataset, with the aforementioned lexicon embeddings as the final input of the contextual encoding layer, and also with the output of the first contextual encoding layer as the input of its second encoding layer. Thus, we obtain the final representation of the contextual encoding layer by a concatenation of the outputs of two BiLSTM: for questions and for passages.

Memory Generation Layer. In this layer, we generate a working memory by fusing information from both passages and questions . The attention function (Vaswani et al., 2017) is used to compute the similarity score between passages and questions as:

Note that and is transformed from and by one layer neural network , respectively. A question-aware passage representation is computed as . After that, we use the method of (Lin et al., 2017) to apply self attention to the passage:

where means that we only drop diagonal elements on the similarity matrix (i.e., attention with itself). At last, and are concatenated and are passed through a BiLSTM to form the final memory: .

Span detector. We adopt a multi-turn answer module for the span detector (Liu et al., 2018b). Formally, at time step in the range of , the state is defined by . The initial state is the summary of the : , where . Here, is computed from the previous state and memory : and . Finally, a bilinear function is used to find the begin and end point of answer spans at each reasoning step :


The final prediction is the average of each time step: . We randomly apply dropout on the step level in each time step during training, as done in (Liu et al., 2018b).

Unanswerable classifier. We adopt a one-layer neural network as our unanswerable binary classifier:


, where is the summary of the memory: , where . denotes the probability of the question which is unanswerable.

Objective The objective function of the joint model has two parts:


Following (Wang and Jiang, 2016), the span loss function is defined:


The objective function of the binary classifier is defined:


where is a binary variable: indicates the question is unanswerable and denotes the question is answerable.

3 Experiment

3.1 Setup

We evaluate our system on SQuAD 2.0 dataset (Rajpurkar et al., 2018), a new MRC dataset which is a combination of Stanford Question Answering Dataset (SQuAD) 1.0 (Rajpurkar et al., 2016) and additional unanswerable question-answer pairs. The answerable pairs are around 100K; while the unanswerable questions are around 53K. This dataset contains about 23K passages and they come from approximately 500 Wikipedia articles. All the questions and answers are obtained by crowd-sourcing. Two evaluation metrics are used: Exact Match (EM) and Macro-averaged F1 score (F1) (Rajpurkar et al., 2018).

3.2 Implementation details

We utilize spaCy555 tool to tokenize the both passages and questions, and generate lemma, part-of-speech and named entity tags. The word embeddings are initialized with pre-trained 300-dimensional GloVe (Pennington et al., 2014). A 2-layer BiLSTM is used encoding the contextual information of both questions and passages. Regarding the hidden size of our model, we search greedily among . During training, Adamax (Kingma and Ba, 2014) is used as our optimizer. The min-batch size is set to 32. The learning rate is initialized to 0.002 and it is halved after every 10 epochs. The dropout rate is set to 0.1. To prevent overfitting, we also randomly set 0.5% words in both passages and questions as unknown words during the training. Here, we use a special token unk to indicate a word which doesn’t appear in GloVe. in Eq 4 is set to 1.

4 Results

We would like to investigate effectiveness the proposed joint model. To do so, the same shared layer/architecture is employed in the following variants of the proposed model:

  1. SAN: it is standard SAN model 666To handle the unanswerable questions, we append a NULL string at the end of the passages for the unanswerable questions. (Liu et al., 2018b), which is trained by using Eq 5.

  2. Joint SAN: the proposed joint model Eq 4.

  3. Joint SAN + Classifier: the proposed joint model Eq 4 which also use the output information from the unanswerable binary classifier 777We set the answer to an empty string if the output probability of the classifier is larger than 0.5..

Single model: EM F1
SAN 67.89 70.68
Joint SAN 69.27 72.20
Joint SAN + Classifier 69.54 72.66
Table 1: Performance on the SQuAD 2.0 development dataset.

The results in terms of EM and F1 is summarized in Table  1. We observe that Joint SAN outperforms the SAN baseline with a large margin, e.g., 67.89 vs 69.27 (+1.38) and 70.68 vs 72.20 (+1.52) in terms of EM and F1 scores respectively, so it demonstrates the effectiveness of the joint optimization. By incorporating the output information of classifier into Joint SAN, it obtains a slight improvement, e.g., 72.2 vs 72.66 (+0.46) in terms of F1 score. By analyzing the results, we found that in most cases when our model extract an NULL string answer, the classifier also predicts it as an unanswerable question with a high probability.

SQuAD 2.0 development dataset
BNA 59.8 62.6
DocQA 61.9 64.8
R.M-Reader 66.9 69.1
R.M-Reader + Verifier 68.5 71.5
Joint SAN 69.3 72.2
SQuAD 2.0 development dataset + ELMo
DocQA 65.1 67.6
R.M-Reader + Verifier 72.3 74.8
SQuAD 2.0 test dataset
BNA 59.2 62.1
DocQA 59.3 62.3
DocQA + ELMo 63.4 66.3
R.M-Reader 71.7 74.2
Joint SAN 68.7 71.4
Table 2: Comparison with published results in literature. : results are extracted from (Rajpurkar et al., 2018); : results are extracted from (Hu et al., 2018). : it is unclear which model is used. : we only evaluate the Joint SAN in the submission.

Table 2 reports comparison results in literature published 888For the full leaderboard results, please refer to Our model achieves state-of-the-art on development dataset in setting without pre-trained large language model (ELMo). Comparing with the much complicated model R.M.-Reader + Verifier, which includes several components, our model still outperforms by 0.7 in terms of F1 score. Furthermore, we observe that ELMo gives a great boosting on the performance, e.g., 2.8 points in terms of F1 for DocQA. This encourages us to incorporate ELMo into our model in future.

Analysis. To better understand our model, we analyze the accuracy of the classifier in our joint model. We obtain 75.3 classification accuracy on the development with the threshold 0.5. By increasing value of in Eq 4, the classification accuracy reached to 76.8 (), however the final results of our model only have a small improvement (+0.2 in terms of F1 score). It shows that it is important to make balance between these two components: the span detector and unanswerable classifier.

5 Conclusion

To sum up, we proposed a simple yet efficient model based on SAN. It showed that the joint learning algorithm boosted the performance on SQuAD 2.0. We also would like to incorporate ELMo into our model in future.


We thank Yichong Xu, Shuohang Wang and Sheng Zhang for valuable discussions and comments. We also thank Robin Jia for the help on SQuAD evaluations.


  • Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning 28(1):41–75.
  • Hu et al. (2018) Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. Read+ verify: Machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1808.05759 .
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 .
  • Liu et al. (2018a) Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a. Stochastic answer networks for natural language inference. arXiv preprint arXiv:1804.07888 .
  • Liu et al. (2015) Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 912–921.
  • Liu et al. (2018b) Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018b. Stochastic answer networks for machine reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 1694–1704.
  • McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. arXiv preprint arXiv:1708.00107 .
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1532–1543.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). volume 1, pages 2227–2237.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, pages 784–789.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text pages 2383–2392.
  • Seo et al. (2016) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 .
  • Shen et al. (2017) Yelong Shen, Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2017. An empirical analysis of multiple-turn reasoning strategies in reading comprehension tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). volume 1, pages 957–966.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. pages 5998–6008.
  • Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905 .
  • Xu et al. (2018) Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing Liu, and Jianfeng Gao. 2018. Multi-task learning for machine reading comprehension. arXiv:1809.06963 .
  • Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description