MaP: A Matrixbased Prediction Approach to Improve Span Extraction in Machine Reading Comprehension
Abstract
Span extraction is an essential problem in machine reading comprehension. Most of the existing algorithms predict the start and end positions of an answer span in the given corresponding context by generating two probability vectors. In this paper, we propose a novel approach that extends the probability vector to a probability matrix. Such a matrix can cover more startend position pairs. Precisely, to each possible start index, the method always generates an end probability vector. Besides, we propose a samplingbased training strategy to address the computational cost and memory issue in the matrix training phase. We evaluate our method on SQuAD 1.1 and three other question answering benchmarks. Leveraging the most competitive models BERT and BiDAF as the backbone, our proposed approach can get consistent improvements in all datasets, demonstrating the effectiveness of the proposed method.
1 Introduction
Machine reading comprehension (MRC), which requires the machine to answer comprehension questions based on the given passage of text, has been studied extensively in the past decades Liu et al. (2019). Due to the increase of various largescale datasets (e.g., SQuAD Rajpurkar et al. (2016) and MS MARCO Nguyen et al. (2016)), and the enhancement of pretrained models (e.g., ELMo Peters et al. (2018), BERT Devlin et al. (2019), and XLNet Yang et al. (2019)), remarkable advancements have been made recently in this area. Among various MRC tasks, span extraction is one of the essential tasks. Given the context and question, the span extraction task is to extract a span of the most plausible text from the corresponding context as a candidate answer. Although there exist unanswerable cases beyond the span extraction, the spanbased task is still fundamental and significant in the MRC field.
Previous methods used to predict the start and end position of an answer span can be divided into two categories. The first one regards the generation of begin position and end position independently. We refer to this category as independent approach. It can be written as , where , the and denote start and end, respectively. is the hidden representation, in which and usually have shared features. The other one constructs a dependent route from the start position when predicting the end position. We refer to this category as conditional approach. It can be formalized as . This category usually reuses the predicted position information (e.g., ) to assist in the subsequent prediction. The difference between these two approaches is that the conditional approach considers the relationship between start and end positions, but the independent approach does not. In the literature, AMANDA Kundu and Ng (2018b), QANet Yu et al. (2018), and SEBert Keskar et al. (2019) can be regarded as the independent approach, where the probabilities of the start and end positions are calculated separately with different representations. DCN Xiong et al. (2017), RNET Wang et al. (2017), BiDAF
The conditional approach empirically has an advantage over the independent approach. However, the output distributions of the previous conditional approaches are two probability vectors. It ignores some more possible startend pairs. As an extension, every possible start (or end) position should have an end (or start) probability vector. Thus, the output conditional probabilities is a matrix.
We propose a Matrixbased Prediction approach (MaP) based on the above consideration in this paper. As Figure 1 shown, the key point is to consider as many probabilities as possible in training and inference phases. Specifically, we calculate a conditional probability matrix instead of a probability vector to expand the choices of startend pairs. Because of more values contained in a matrix than a vector, there is a big challenge in the training phase of the MaP. That is the high computational cost and memory issues if the input sequence is long. As an instance, the matrix contains probability values if the sequence length is 512. Therefore, we propose a samplingbased training strategy to speed up the training and reduce the memory cost.
The main contributions of our work are fourfold.

A novel conditional approach is proposed to address the limitation of the probability vector generated by the vectorbased conditional approach. It increases the likelihood of hitting the groundtruth start and end positions.

A samplingbased training strategy is proposed to overcome the computation and memory issues in the training phase of the matrixbased conditional approach.

An ensemble approach on both starttoend and endtostart directions of conditional probability is investigated to improve the accuracy of the answer span.

We evaluate our strategy on SQuAD 1.1 and three other question answering benchmarks. The implementation of the matrixbased conditional approach is designed based on the BERT and BiDAF, which are the most competitive models, to test the generalization of our strategy. The consistent improvements in all datasets demonstrate the effectiveness of the strategy.
2 Methodology
In this section, we first give the problem definition. Then we introduce a typical vectorbased conditional approach. Next, we mainly introduce our matrixbased conditional approach and samplingbased training strategy. Finally, an ensemble approach on both starttoend and endtostart directions of conditional probability is discussed.
2.1 Problem Statement
Given the passage and the question , the span extraction task needs to extract the continuous subsequence () from the passage as the right answer to the question, where and are the length of the passage and question respectively, and are the start and end position in the passage. Usually, the objective to predict is maximizing the conditional probability .
2.2 A Typical Vectorbased Approach
We summarize a typical implementation of the vectorbased conditional approach shown in Figure 2. Previous mentioned RNET, BiDAF, MatchLSTM, SNet, and SDNet can be regarded as such implementation. Its backbone is the Pointer Network proposed by Vinyals et al. (2015). The interactive representation between the given question and passage is calculated as follows,
(1) 
where is a neural network, e.g., MatchLSTM, QANet, BERT, and XLNet, is the dimension size of the representation. After generating the interactive representation, the next step is to predict the answer span.
The main architecture of the span prediction is an RNN. As an instance, LSTM is used in Wang and Jiang (2017), and GRU is adopted in Tan et al. (2018); Zhu et al. (2018). Take the hidden representation of end position as an example, which is calculated as follows,
(2)  
(3) 
where is the start probability and , is the dimension size of . Then can be calculated using as follows,
(4) 
where is an operation that generates a matrix by repeating the vector on the left times, , , and are parameters to be learned.
The calculation of is similar to . The key is to obtain the hidden state . A choice is to use an attention approach to condense the question representation into a vector. The process is as follows,
(5)  
(6) 
where is the representation corresponding to , , and are parameters.
There is a vast number of works on MRC. However, most of these works focus on the design of and generate the answer span based on the vectorbased conditional approach. In this paper, we expand the vector to a probability matrix. Thus, many more possibilities can be covered. It is also a natural manner because that every start (or end) position should have an end (or start) probability vector.
2.3 Matrixbased Conditional approach
As the previous description, the implementation of the vectorbased conditional approach has a unified and important implementation step: create a ‘condition’. Take the forward direction (‘condition’ constructed from the start position to end position) of the vectorbased conditional approach as an example, the ‘condition’ is the probability vector . The end probability vector can not be calculated until generating . However, there is only one probability vector whatever the start position is. In this paper, we keep the ‘condition’ step but propose calculating an individual for each start position. Specifically, the probability matrix is calculated as follows,
(7) 
where denotes the th row of , is a concatenate operation, is an operation that generates a matrix by repeating the vector on the left times, means to choose the th row from the matrix , and are parameters. Figure 3 illustrates the calculation process of Eq. (7).
Although the calculation is brief and can cover more probabilities than the vectorbased approach, there is a big question on computation cost and memory occupation. The main computation cost comes from the matrix multiplication between and in Eq. (7), totally times such computation for . The number of probabilities is also times bigger than the vectorbased conditional approach. It also causes the issue of out of memory (OOM), especially with a big , due to intermediate gradient values needing cache in the training phase. We propose a samplingbased training strategy to solve the above issues.
2.4 Samplingbased Training Strategy
In order to train the probability matrix effectively, we propose a samplingbased strategy in the training phase. Given the hyperparameter , we first choose the indexes of top possibilities from ,
(8) 
where is an operation used to get the indexes of top values in , contains all but th value of , and is the truth start position used as the supervised information in the training phase. Then, the must merge to ,
(9) 
where contains indexes.
Eq. (8) and Eq. (9) promise that the sampled start probabilities must contain and only contain the target probability which we need to train in each iteration. The target probability is the th value in , and the bigger, the better.
After sampling the start probability vector, the computation cost of decrease. For each , executing Eq. (7) repeatedly can generate a samplingbased end probability matrix. It is noted that this samplingbased matrix is a part of the original . We refer to it as , and . It is still a big issue of computation cost and memory occupation for with a long sequence. So, we carry out similar operations in Eq. (8) and Eq. (9) for each row of using instead of , where is the end truth position. Finally, the samplingbased matrix is generated. It is small enough to train compared with . Figure 4 shows the sampling results colored with a yellow background on the left and corresponding ground truth matrix on the right.
2.5 Training
In the training phase, the objective function is to minimize the crossentropy error averaged over start and end positions,
(10)  
(11)  
(12) 
where is the number of data, means the onehot vector of , means a zero matrix with a value of 1 in row and column , and is a row wise flatten operation. The flatten operation makes the loss function on matrixbased distribution similar to that on vectorbased distribution.
As the introduction of the samplingbased training strategy, there are limited end probabilities that could be trained in each iteration. The extreme situation is equals to , which makes all probability matrix calculate each time. As our previous argumentation, it is almost impossible for time and memory limitations. However, there is a question of what makes sampling strategy works. The following content gives some explanation based on gradient backpropagation.
The gradient of the crossentropy to the predicted logits is,
(13) 
where is probabilities in which values are between 0 and 1 (exclusion). Thus is negative, and is positive in most cases. As the parameters update usually follows and learning rate is a positive value, the probability in groundtruth position should go up, and the probabilities in other sampled positions should go down.
Figure 5 illustrates the samplingbased training process, where the parameter is set to 5. It means that there are extra top4 probabilities (blue background) except groundtruth (red background) will be chosen to calculate. With the iteration going from #1 to #3, the probability in groundtruth position goes up, and that in sampled top4 positions goes down. Such a samplingbased training approach has the same goal with the training on the whole probabilities, thus should have proximity results.
2.6 Ensemble for Inference
The vectorbased conditional approach usually searches the span via the computation of under the condition of , and choices the with the highest as the output in the inference phase. The matrixbased conditional approach follows the same idea, but the calculation of the probability is instead of . The is the th probability in , and is the probability in row , column of .
The above inference strategy only involves one direction, e.g., starttoend direction (generate start position firstly, then generate end position), which is the most cases in previous works. An ensemble of both starttoend and endtostart directions is a good choice to improve the performance. The difference in endtostart direction is that Eqs. (712) should be repeated in the opposite direction. In other words, the start is replaced by , and the end is replaced by . Totally, there are two groups of probabilities, and . In this paper, we design a type of ensemble strategy, which first chooses top pairs with highest probability , then chooses top pairs with highest probability . It is noted that some pairs may have the same position, e.g., and . If there are the same elements, we prune away them in . Then, we choose the with highest probability in .
The overall training procedure of MaP is summarized in Algorithm 1.
3 Experiments
In this section, we conduct experiments to evaluate the effectiveness of the proposed MaP.
Models  SQuAD  NewsQA  HotpotQA  Natural Questions  

EM  F1  EM  F1  EM  F1  EM  F1  
BERTBase  InD  81.24  88.38  52.59  67.12  59.01  75.69  67.31  78.96 
MaP  81.78  88.59  52.66  66.50  59.82  75.81  67.68  78.99  
MaP  82.12  88.63  53.06  67.37  60.55  76.12  68.21  79.09  
BERTLarge  InD  84.05  90.85  54.46  69.61  62.26  78.18  69.44  80.93 
MaP  84.50  90.89  54.84  68.73  63.19  78.99  69.56  80.49  
MaP  84.79  90.89  55.29  69.98  63.70  79.25  69.91  81.22  
BiDAF  VCP  68.57  78.23  44.04  58.07  47.31  62.42  56.95  68.79 
MaP  68.85  78.06  44.19  58.65  50.25  65.21  57.04  68.87  
MaP  69.55  78.91  44.25  58.91  51.45  66.74  57.21  69.08 
3.1 Datasets
We first evaluate our strategy on SQuAD 1.1, which is a reading comprehension benchmark. The benchmark benefits to our evaluation compared with its augmented version SQuAD 2.0 due to its questions always have a corresponding answer in the given passages. We also evaluate our strategy on three other datasets from the MRQA 2019 Shared Task
Dataset  Training  Development 

SQuAD 1.1  86,588  10,507 
NewsQA  74,160  4,212 
HotpotQA  72,928  5,904 
Natural Questions  104,071  12,836 
3.2 Baselines
To validate the effectiveness and generalization of our proposed strategy on the span extraction, we implement it using two strong backbones, BERT and BiDAF. Specifically, we borrow their main bodies except the top layer to implement the proposed strategy to finish the span extraction on different datasets. Some more tests on other models, e.g., XLNet Yang et al. (2019) and SpanBERT Joshi et al. (2019), and datasets will be our future work.

BERT: BERT is an empirically powerful language model, which obtained stateoftheart results on eleven natural language processing tasks in the past Devlin et al. (2019). The original implementation in their paper on the span prediction task belongs to the independent approach. Both BERTbase and BERTlarge with uncased pretrained weights are used in comparison to investigating the effect of the ability of language model on span extraction with different prediction approaches.

BiDAF: BiDAF is used as a baseline of the vectorbased conditional approach Seo et al. (2017). The use of a multistage hierarchical process and a bidirectional attention flow mechanism makes its representation powerful.
There are four strategies of span extraction involved in our comparison: InD denotes the independent approach; VCP is the vectorbased conditional approach; MaP is our matrixbased conditional approach calculating on starttoend direction; MaP means the ensemble of both directions of matrixbased conditional approach. The InD is used to compare with MaP and MaP in BERT, and the VCP is used to compare with MaP and MaP in BiDAF.
3.3 Experimental Settings
We implement the BERT and BiDAF following the official settings for a fair comparison. For the BERT, we train for 3 epochs with a learning rate of and a batch size of 32. The max sequence length is 384 for SQuAD 1.1 and 512 for other datasets, and a sliding window of size 128 is used for all datasets is the sentence is longer than the max length. For the BiDAF, we keep all original settings except a difference that we use ADAM Kingma and Ba (2015) optimizer with a learning rate of in the training phase instead of AdaDelta Zeiler (2012) for a stable performance. Following the work from Rajpurkar et al. (2016), we evaluate the results using Exact Match (EM) and Macroaveraged F1 score. The sampling parameter is set to 20 for our strategy. We implement our model in python using the pytorchtransformers library
3.4 Main Results
The results of our strategies as well as the baselines are shown in Table 1. All these values come from the evaluation of the development sets in each dataset due to the test sets are withheld. Nevertheless, our strategy achieves a consistent improvement compared with the independent approach and the vectorbased conditional approach. The values with a bold type mean the winner across all strategies. As we can observe, the MaP wins 16 out of 16 in both BERTbase and BERTlarge groups. It proves that the ensemble of both directions is helpful for the span extraction. In the BiDAF group, The MaP is also the best on all datasets compared with VCP. It shows the robustness of our matrixbased conditional approach in language models. The fact that the MaP wins 12 out of 12 in EM, and 8 out of 12 in F1 demonstrates that the matrixbased conditional approach is capable of predicting a clean answer span that matches human judgment exactly. We suppose the reason is that more startend position pairs considered in the probability matrix can enhance the interaction and constraint between the start and end, thus, make the MaP perform more consistently in EM than in F1.
3.5 Strategy Analysis
Figure 6 shows how the performance changes with respect to the answer length, which is designed on HotpotQA. We can see that the matrixbased conditional approach works better than the vectorbased conditional approach as the span decrease in length. Since the short answers have a high rate in all answer spans, so the matrixbased conditional approach is better for the answer span task. In other words, this observation supports the ensemble of both directions as does. The MaP combining the MaP’s advantage in short answers and the VCP’s advantage in long answers can get a better result than any of them.
We investigate the impact of used to choose the top probabilities in the training phase. The results are shown in Figure 7. With the increase of , the EM and F1 show a downtrend. The best performance happens at . We guess that choosing more probabilities makes the training difficult and brings extra noises to the candidate positions. E.g., if is set to 30, the number of candidate probabilities will be 900, which is larger than the sequence length 512 in vectorbased conditional approach.
We analyze the convergence of the samplingbased training strategy on SQuAD 1.1. Due to the effectiveness of the samplingbased training strategy is proved in MaP, we conduct an further experiment under the VCP to prove its generalization. Figure 8 demonstrates the results. As our expectation, the samplingbased training strategy optimizes the model as training in whole samples. However, it will cost longer training steps to get the same loss compared with standard training. So our samplingbased training strategy is good for the training of the matrixbased conditional approach.
4 Related Work
Machine reading comprehension is an important topic in the NLP community. More and more neural network models are proposed to tackle this problem, including DCN Xiong et al. (2017), RNET Wang et al. (2017), BiDAF Seo et al. (2017), MatchLSTM Wang and Jiang (2017), SNet Tan et al. (2018), SDNet Zhu et al. (2018), QANet Yu et al. (2018), HASQA Pang et al. (2019). Among various MRC tasks, span extraction is a typical task that extracting a span of text from the corresponding passage as the answer of a given question. It can well overcome the weakness that words or entities are not sufficient to answer questions Liu et al. (2019).
Previous models proposed for span extraction mostly focus on the design of architecture, especially on the representation of question and passage, and the interaction between them. There are few works devoted to the toplevel design of span output, which refers to the probabilities generation from the representation. We divide the previous toplevel design into two categories, independent approach and conditional approach. The independent approach is to predict the start and end positions in the given passage independently Kundu and Ng (2018a); Yu et al. (2018). Although the independent approach has a simple assumption, it works well when the input features are strong enough, e.g., combining with BERT Devlin et al. (2019), XLNet Yang and Song (2019), and SpanBERT Joshi et al. (2019). Nevertheless, since there is a kind of dependency relationship between start and end positions, the conditional approach has advancements over the independent approach.
A typical work on the conditional approach comes from Wang and Jiang (2017). They proposed two different models based on the Pointer Network. One is the sequence model which produces a sequence of answer tokens as the final output, and another is the boundary model which produces only the start token and the end token of the answer. The experimental results demonstrate that the boundary model (span extraction) is superior to the sequence model on both EM and F1. The RNET Wang et al. (2017), BiDAF Seo et al. (2017), SNet Tan et al. (2018), SDNet Zhu et al. (2018) have the same output layer and inference phase with the boundary model in Wang and Jiang (2017). Lee et al. (2016) presented an architecture that builds fixed length representations of all spans in the passage with a recurrent network to address the answer extraction task. The computation cost is decided by the maxlength of the possible span and the sequence length. The experimental results show an improvement on EM compared with the endpoints prediction that independently predicts the two endpoints of the answer span.
However, previous works related to the conditional approach are always based on a probability vector. We investigate another possible matrixbased conditional approach in this paper. Besides, a wellmatched training strategy is proposed to our approach, and forward and backward conditional possibilities are also integrated to improve the performance.
5 Conclusion
In this paper, we first investigate different approaches of span extraction in MRC. To improve the current vectorbased conditional approach, we propose a matrixbased conditional approach. More careful consideration of the dependencies between the start and end positions of the answer span can predict their values better. We also propose a samplingbased training strategy to address the training process of the matrixbased conditional approach. The final experimental results on a wide of datasets demonstrate the effectiveness of our approach and training strategy.
Acknowledgments
This work was supported by National Key R&D Program of ChinaÂ (2019YFB2101802)Â and Sichuan Key R&D project (2020YFG0035).
Footnotes
 We classify BiDAF as a conditional approach by its official implementation: https://github.com/allenai/biattflow
 https://github.com/mrqa/MRQASharedTask2019
 https://github.com/huggingface/pytorchtransformers
 https://github.com/allenai/allennlp
References
 Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, pages 4171–4186.
 Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pretraining by representing and predicting spans. arXiv:1907.10529.
 Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Unifying question answering and text classification via span extraction. arXiv:1904.09286.
 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
 Souvik Kundu and Hwee Tou Ng. 2018a. A nilaware answer extraction framework for question answering. In EMNLP, pages 4243–4252.
 Souvik Kundu and Hwee Tou Ng. 2018b. A questionfocused multifactor attention network for question answering. In AAAI, pages 5828–5835.
 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. TACL, 7:453–466.
 Kenton Lee, Shimi Salant, Tom Kwiatkowski, Ankur Parikh, Dipanjan Das, and Jonathan Berant. 2016. Learning recurrent span representations for extractive question answering. arXiv:1611.01436.
 Shanshan Liu, Xin Zhang, Sheng Zhang, Hui Wang, and Weiming Zhang. 2019. Neural machine reading comprehension: Methods and trends. arXiv:1907.01118.
 Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 colocated with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), volume 1773.
 Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Lixin Su, and Xueqi Cheng. 2019. HASQA: hierarchical answer spans model for opendomain question answering. In AAAI, pages 6875–6882.
 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACLHLT, pages 2227–2237.
 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
 Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
 Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, and Ming Zhou. 2018. Snet: From answer extraction to answer generation for machine reading comprehension. In AAAI.
 Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Rep4NLP@ACL, pages 191–200.
 Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 2692–2700.
 Shuohang Wang and Jing Jiang. 2017. Machine comprehension using matchlstm and answer pointer. In ICLR 2017: International Conference on Learning Representations, Toulon, France, April 2426: Proceedings, pages 1–15.
 Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated selfmatching networks for reading comprehension and question answering. In ACL, pages 189–198.
 Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic coattention networks for question answering. In ICLR.
 Liu Yang and Lijing Song. 2019. Contextual aware joint probability model towards question answering system. arXiv:1904.08109.
 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv:1906.08237.
 Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multihop question answering. In EMNLP, pages 2369–2380.
 Adams Wei Yu, David Dohan, MinhThang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global selfattention for reading comprehension. In ICLR.
 Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv:1212.5701.
 Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2018. Sdnet: Contextualized attentionbased deep network for conversational question answering. arXiv:1812.03593.