PubMedQA: A Dataset for Biomedical Research Question Answering
We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much room for improvement. PubMedQA is publicly available at https://pubmedqa.github.io.
A long-term goal of natural language understanding is to build intelligent systems that can reason and infer over natural language. The question answering (QA) task, in which models learn how to answer questions, is often used as a benchmark for quantitatively measuring the reasoning and inferring abilities of such intelligent systems.
While many large-scale annotated general domain QA datasets have been introduced rajpurkar2016squad; lai2017race; kovcisky2018narrativeqa; yang2018hotpotqa; kwiatkowski2019natural, the largest annotated biomedical QA dataset, BioASQ tsatsaronis2015overview has less than 3k training instances, most of which are simple factual questions. Some works proposed automatically constructed biomedical QA datasets pampari2018emrqa; pappas2018bioread; kim2018pilot, which have much larger sizes. However, questions of these datasets are mostly factoid, whose answers can be extracted in the contexts without much reasoning.
In this paper, we aim at building a biomedical QA dataset which (1) has substantial instances with some expert annotations and (2) requires reasoning over the contexts to answer the questions. For this, we turn to the PubMed111https://www.ncbi.nlm.nih.gov/pubmed/, a search engine providing access to over 25 million references of biomedical articles. We found that around 760k articles in PubMed use questions as their titles. Among them, the abstracts of about 120k articles are written in a structured style – meaning they have subsections of “Introduction”, “Results” etc. Conclusive parts of the abstracts, often in “Conclusions”, are the authors’ answers to the question title. Other abstract parts can be viewed as the contexts for giving such answers. This pattern perfectly fits the scheme of QA, but modeling it as abstractive QA, where models learn to generate the conclusions, will result in an extremely hard task due to the variability of writing styles.
Interestingly, more than half of the question titles of PubMed articles can be briefly answered by yes/no/maybe, which is significantly higher than the proportions of such questions in other datasets, e.g.: just 1% in Natural Questions kwiatkowski2019natural and 6% in HotpotQA yang2018hotpotqa. Instead of using conclusions to answer the questions, we explore answering them with yes/no/maybe and treat the conclusions as a long answer for additional supervision.
To this end, we present PubMedQA, a biomedical QA dataset for answering research questions using yes/no/maybe. We collected all PubMed articles with question titles, and manually labeled 1k of them for cross-validation and testing. An example is shown in Fig. 1. The rest of yes/no/answerable QA instances compose of the unlabeled subset which can be used for semi-supervised learning. Further, we automatically convert statement titles of 211.3k PubMed articles to questions and label them with yes/no answers using a simple heuristic. These artificially generated instances can be used for pre-training. Unlike other QA datasets in which questions are asked by crowd-workers for existing contexts rajpurkar2016squad; yang2018hotpotqa; kovcisky2018narrativeqa, in PubMedQA contexts are generated to answer the questions and both are written by the same authors. This consistency assures that contexts are perfectly related to the questions, thus making PubMedQA an ideal benchmark for testing scientific reasoning abilities.
As an attempt to solve PubMedQA and provide a strong baseline, we fine-tune BioBERT lee2019biobert on different subsets in a multi-phase style with additional supervision of long answers. Though this model generates decent results and vastly outperforms other baselines, it’s still much worse than the single-human performance, leaving significant room for future improvements.
2 Related Works
Expert-annotated biomedical QA datasets are limited by scale due to the difficulty of annotations. In 2006 and 2007, TREC222https://trec.nist.gov/ held QA challenges on genomics corpus hersh2006trec; hersh2007trec, where the task is to retrieve relevant documents for 36 and 38 topic questions, respectively. QA4MRE penas2013qa4mre included a QA task about Alzheimer’s disease morante2012machine. This dataset has 40 QA instances and the task is to answer a question related to a given document using one of five answer choices. The QA task of BioASQ tsatsaronis2015overview has phases of (a) retrieve question-related documents and (b) using related documents as contexts to answer yes/no, factoid, list or summary questions. BioASQ 2019 has a training set of 2,747 QA instances and a test set of 500 instances.
Several large-scale automatically collected biomedical QA datasets have been introduced: emrQA pampari2018emrqa is an extractive QA dataset for electronic medical records (EHR) built by re-purposing existing annotations on EHR corpora. BioRead pappas2018bioread and BMKC kim2018pilot both collect cloze-style QA instances by masking biomedical named entities in sentences of research articles and using other parts of the same article as context.
Datasets such as HotpotQA yang2018hotpotqa, Natural Questions kwiatkowski2019natural, ShARC saeidi2018interpretation and BioASQ tsatsaronis2015overview contain yes/no questions as well as other types of questions. BoolQ clark2019boolq specifically focuses on naturally occurring yes/no questions, and those questions are shown to be surprisingly difficult to answer. We add a “maybe” choice in PubMedQA to cover uncertain instances.
Typical neural approaches to answering yes/no questions involve encoding both the question and context, and decoding the encoding to a class output, which is similar to the well-studied natural language inference (NLI) task. Recent breakthroughs of pre-trained language models like ELMo peters2018deep and BERT devlin2018bert show significant performance improvements on NLI tasks. In this work, we use domain specific versions of them to set baseline performance on PubMedQA.
3 PubMedQA Dataset
3.1 Data Collection
PubMedQA is split into three subsets: labeled, unlabeled and artificially generated. They are denoted as PQA-L(abeled), PQA-U(nlabeled) and PQA-A(rtificial), respectively. We show the architecture of PubMedQA dataset in Fig. 2.
|Number of QA pairs||1.0k||61.2k||211.3k|
|Prop. of yes (%)||55.2||–||92.8|
|Prop. of no (%)||33.8||–||7.2|
|Prop. of maybe (%)||11.0||–||0.0|
|Avg. question length||14.4||15.0||16.3|
|Avg. context length||238.9||237.3||238.0|
|Avg. long answer length||43.2||45.9||41.0|
Collection of PQA-L and PQA-U:
PubMed articles which have i) a question mark in the titles and ii) a structured abstract with conclusive part are collected and denoted as pre-PQA-U. Now each instance has 1) a question which is the original title 2) a context which is the structured abstract without the conclusive part and 3) a long answer which is the conclusive part of the abstract.
Two annotators333Both are qualified M.D. candidates. labeled 1k instances from pre-PQA-U with yes/no/maybe to build PQA-L using Algorithm 1. The annotator 1 doesn’t need to do much reasoning to annotate since the long answer is available. We denote this reasoning-free setting. However, the annotator 2 cannot use the long answer, so reasoning over the context is required for annotation. We denote such setting as reasoning-required setting. Note that the annotation process might assign wrong labels when both annotator 1 and annotator 2 make a same mistake, but considering human performance in §5.1, such error rate could be as low as 1%444Roughly half of the products of two annotator error rates.. 500 randomly sampled PQA-L instances are used for 10-fold cross validation and the rest 500 instances consist of PubMedQA test set.
Further, we include the unlabeled instances in pre-PQA-U with yes/no/maybe answerable questions to build PQA-U. For this, we use a simple rule-based method which removes all questions started with interrogative words (i.e. wh-words) or involving selections from multiple entities. This results in over 93% agreement with annotator 1 in identifying the questions that can be answered by yes/no/maybe.
|Original Statement Title||Converted Question||Label||%|
|Spontaneous electrocardiogram alterations predict ventricular fibrillation in Brugada syndrome.||Do spontaneous electrocardiogram alterations predict ventricular fibrillation in Brugada syndrome?||yes||92.8|
|Liver grafts from selected older donors do not have significantly more ischaemia reperfusion injury.||Do liver grafts from selected older donors have significantly more ischaemia reperfusion injury?||no||7.2|
Collection of PQA-A:
Motivated by the recent successes of large-scale pre-training from ELMo peters2018deep and BERT devlin2018bert, we use a simple heuristic to collect many noisily-labeled instances to build PQA-A for pre-training. Towards this end, we use PubMed articles with 1) a statement title which has POS tagging structures of NP-(VBP/VBZ)555Using Stanford CoreNLP parser manning-EtAl:2014:P14-5. and 2) a structured abstract including a conclusive part. The statement titles are converted to questions by simply moving or adding copulas (“is”, “are”) or auxiliary verbs (“does”, “do”) in the front and further revising for coherence (e.g.: adding a question mark). We generate the yes/no answer according to negation status of the VB. Several examples are shown in Table 2. We collected 211.3k instances for PQA-A, of which 200k randomly sampled instances are for training and the rest 11.3k instances are for validation.
We show the basic statistics of three PubMedQA subsets in Table 1.
PubMed abstracts are manually annotated by medical librarians with Medical Subject Headings (MeSH)666https://www.nlm.nih.gov/mesh, which is a controlled vocabulary designed to describe the topics of biomedical texts. We use MeSH terms to represent abstract topics, and visualize their distribution in Fig. 3. Nearly all instances are human studies and they cover a wide variety of topics, including retrospective, prospective, and cohort studies, different age groups, and healthcare-related subjects like treatment outcome, prognosis and risk factors of diseases.
|Question Type||%||Example Questions|
|Does a factor influence the output?||36.5||Does reducing spasticity translate into functional benefit?|
|Does ibuprofen increase perioperative blood loss during hip arthroplasty?|
|Is a therapy good/necessary?||26.0||Should circumcision be performed in childhood?|
|Is external palliative radiotherapy for gallbladder carcinoma effective?|
|Is a statement true?||18.0||Sternal fracture in growing children: A rare and often overlooked fracture?|
|Xanthogranulomatous cholecystitis: a premalignant condition?|
|Is a factor related to the output?||18.0||Can PRISM predict length of PICU stay?|
|Is trabecular bone related to primary stability of miniscrews?|
|Reasoning Type||%||Example Snippet in Context|
|Inter-group comparison||57.5||[…] Postoperative AF was significantly lower in the Statin group compared with the Non-statin group (16% versus 33%, p=0.005). […]|
|Interpreting subgroup statistics||16.5||[…] 57% of patients were of lower socioeconomic status and they had more health problems, less functioning, and more symptoms […]|
|Interpreting (single) group statistics||16.0||[…] A total of 4 children aged 5-14 years with a sternal fracture were treated in 2 years, 2 children were hospitalized for pain management and […]|
|Text Interpretations of Numbers||%||Example Snippet in Context|
|Existing interpretations of numbers||75.5||[…] Postoperative AF was significantly lower in the Statin group compared with the Non-statin group (16% versus 33%, p=0.005). […]|
|No interpretations (numbers only)||21.0||[…] 30-day mortality was 12.4% in those aged70 years and 22% in those70 years (p0.001). […]|
|No numbers (texts only)||3.5||[…] The halofantrine therapeutic dose group showed loss and distortion of inner hair cells and inner phalangeal cells […]|
Question and Reasoning Types:
We sampled 200 examples from PQA-L and analyzed the types of questions and types of reasoning required to answer them, which is summarized in Table 3. Various types of questions have been asked, including causal effects, evaluations of therapies, relatedness, and whether a statement is true. Besides, PubMedQA also covers several different reasoning types: most (57.5%) involve comparing multiple groups (e.g.: experiment and control), and others require interpreting statistics of a single group or its subgroups. Reasoning over quantitative contents is required in nearly all (96.5%) of them, which is expected due to the nature of biomedical research. 75.5% of contexts have text descriptions of the statistics while 21.0% only have the numbers. We use a Sankey diagram to show the proportional relationships between corresponded question type and reasoning type, as well as corresponded reasoning type and whether there are text interpretations of numbers in Fig. 4.
3.3 Evaluation Settings
The main metrics of PubMedQA are accuracy and macro-F1 on PQA-L test set using question and context as input. We denote prediction using question and context as a reasoning-required setting, because under this setting answers are not directly expressed in the input and reasoning over the contexts is required to answer the question. Additionally, long answers are available at training time, so generation or prediction of them can be used as an auxiliary task in this setting.
A parallel setting, where models can use question and long answer to predict yes/no/maybe answer, is denoted as reasoning-free setting since yes/no/maybe are usually explicitly expressed in the long answers (i.e.: conclusions of the abstracts). Obviously, it’s a much easier setting which can be exploited for bootstrapping PQA-U.
4.1 Fine-tuning BioBERT
We fine-tune BioBERT lee2019biobert on PubMedQA as a baseline. BioBERT is initialized with BERT devlin2018bert and further pre-trained on PubMed abstracts and PMC777https://www.ncbi.nlm.nih.gov/pmc/ articles. Expectedly, it vastly outperforms BERT in various biomedical NLP tasks. We denote the original transformer weights of BioBERT as .
While fine-tuning, we feed PubMedQA questions and contexts (or long answers), separated by the special [SEP] token, to BioBERT. The yes/no/maybe labels are predicted using the special [CLS] embedding using a softmax function. Cross-entropy loss of predicted and true label distribution is denoted as .
4.2 Long Answer as Additional Supervision
Under reasoning-required setting, long answers are available in training but not inference phase. We use them as an additional signal for training: similar to ma2018bag regularizing neural machine translation models with binary bag-of-word (BoW) statistics, we fine-tune BioBERT with an auxiliary task of predicting the binary BoW statistics of the long answers, also using the special [CLS] embedding. We minimize binary cross-entropy loss of this auxiliary task:
where and are ground-truth and predicted probability of whether token is in the long answers (i.e.: and ), and is the BoW vocabulary size. The total loss is:
In reasoning-free setting which we use for bootstrapping, the regularization coefficient is set to 0 because long answers are directly used as input.
4.3 Multi-phase Fine-tuning Schedule
Since PQA-A and PQA-U have different properties from the ultimate test set of PQA-L, BioBERT is fine-tuned in a multi-phase style on different subsets. Fig. 5 shows the architecture of this training schedule. We use , , , to denote question, context, long answer and yes/no/maybe label of instances, respectively. Their source subsets are indexed by the superscripts of A for PQA-A, U for PQA-U and L for PQA-L.
Phase I Fine-tuning on PQA-A:
PQA-A is automatically collected whose questions and labels are artificially generated. As a result, questions of PQA-A might differ a lot from those of PQA-U and PQA-L, and it only has yes/no labels with a very imbalanced distribution (92.8% yes v.s. 7.2% no). Despite these drawbacks, PQA-A has substantial training instances so models could still benefit from it as a pre-training step.
Thus, in Phase I of multi-phase fine-tuning, we initialize BioBERT with , and fine-tune it on PQA-A using question and context as input:
Phase II Fine-tuning on Bootstrapped PQA-U:
To fully utilize the unlabeled instances in PQA-U, we exploit the easiness of reasoning-free setting to pseudo-label these instances with a bootstrapping strategy: first, we initialize BioBERT with , and fine-tune it on PQA-A using question and long answer (reasoning-free),
then we further fine-tune on PQA-L, also under the reasoning-free setting:
We pseudo-label PQA-U instances using the most confident predictions of for each class. Confidence is simply defined by the corresponding softmax probability and then we label a subset which has the same proportions of yes/no/maybe labels as those in the PQA-L:
In phase II, we fine-tune on the bootstrapped PQA-U using question and context (under reasoning-required setting):
Final Phase Fine-tuning on PQA-L:
In the final phase, we fine-tune on PQA-L:
Final predictions on instances of PQA-L validation and test sets are made using :
4.4 Compared Models
The majority (about 55%) of the instances have the label “yes”. We use a trivial baseline denoted as Majority where we simply predict “yes” for all instances, regardless of the question and context.
For each instance, we include the following shallow features: 1) TF-IDF statistics of the question 2) TF-IDF statistics of the context/long answer and 3) sum of IDF of the overlapping non-stop words between the question and the context/long answer. To allow multi-phase fine-tuning, we apply a feed-forward neural network on the shallow features instead of using a logistic classifier.
We simply concatenate the question and context/long answer with learnable segment embeddings appended to the biomedical word2vec embeddings Pyysalo2013DistributionalSR of each token. The concatenated sentence is then fed to a biLSTM, and the final hidden states of the forward and backward network are used for classifying the yes/no/maybe label.
ESIM with BioELMo:
Following the state-of-the-art recurrent architecture of NLI peters2018deep, we use pre-trained biomedical contextualized embeddings BioELMo jin2019probing for word representations. Then we apply the ESIM model chen2016enhanced, where a biLSTM is used to encode the question and context/long answer, followed by an attentional local inference layer and a biLSTM inference composition layer. After pooling, a softmax output unit is applied for predicting the yes/no/maybe label.
4.5 Compared Training Schedules
Final Phase Only:
Under this setting, we train models only on PQA-L. It’s an extremely low resources setting where there are only 450 training instances in each fold of cross-validation.
Phase I + Final Phase:
Under this setting, we skip the training on bootstrapped PQA-U. Models are first fine-tuned on PQA-A, and then fine-tuned on PQA-L.
Phase II + Final Phase:
Under this setting, we skip the training on PQA-A. Models are first fine-tuned on bootstrapped PQA-U, and then fine-tuned on PQA-L.
Instead of training a model sequentially on different splits, under single-phase training setting we train the model on the combined training set of all PQA splits: PQA-A, bootstrapped PQA-U and PQA-L.
5.1 Human Performance
Human performance is measured during the annotation: As shown in Algorithm 1, annotations of annotator 1 and annotator 2 are used to calculate reasoning-free and reasoning-required human performance, respectively, against the discussed ground truth labels. Human performance on the test set of PQA-L is shown in Table 4. We only test single-annotator performance due to limited resources. kwiatkowski2019natural show that an ensemble of annotators perform significantly better than single-annotator, so the results reported in Table 4 are the lower bounds of human performance. Under reasoning-free setting where the annotator can see the conclusions, a single human achieves 90.4% accuracy and 84.2% macro-F1. Under reasoning-required setting, the task becomes much harder, but it’s still possible for humans to solve: a single annotator can get 78.0% accuracy and 72.2% macro-F1.
|Setting||Accuracy (%)||Macro-F1 (%)|
5.2 Main Results
|Model||Final Phase Only||Single-phase||Phase I + Final||Phase II + Final||Multi-phase|
|ESIM w/ BioELMo||53.90||32.40||61.28||42.99||61.96||43.32||60.34||44.38||62.08||45.75|
|ESIM w/ BioELMo||53.96||31.07||62.68||43.59||63.72||47.04||60.16||45.81||63.72||47.90|
We report the test set performance of different models and training schedules in Table 5. In general, multi-phase fine-tuning of BioBERT with additional supervision outperforms other baselines by large margins, but the results are still much worse than just single-human performance.
Comparison of Models:
A trend of BioBERT ESIM w/ BioELMo BiLSTM shallow features majority, conserves across different training schedules on both accuracy and macro-F1. Fine-tuned BioBERT is better than state-of-the-art recurrent model of ESIM w/ BioELMo, probably because BioELMo weights are fixed while all BioBERT parameters can be fine-tuned, which better benefit from the pre-training settings.
Comparison of Training Schedules:
Multi-phase fine-tuning setting gets 5 out of 9 model-wise best accuracy/macro-F1. Due to lack of annotated data, training only on the PQA-L (final phase only) generates similar results as the majority baseline. In phase I + Final setting where models are pre-trained on PQA-A, we observe significant improvements on accuracy and macro-F1 and some models even achieve their best accuracy under this setting. This indicates that a hard task with limited training instances can be at least partially solved by pre-training on a large automatically collected dataset when the tasks are similarly formatted.
Improvements are also observed in phase II + Final setting, though less significant than those of phase I + Final. As expected, multi-phase fine-tuning schedule is better than single-phase, due to different properties of the subsets.
Despite its simplicity, the auxiliary task of long answer BoW prediction clearly improves the performance: most results (28/40) are better with such additional supervision than without.
5.3 Intermediate Results
In this section we show the intermediate results of multi-phase fine-tuning schedule.
|Model||w/o A.S.||w/ A.S.|
|ESIM w/ BioELMo||94.82||74.01||95.04||75.22|
|Model||Eq. 4.3||Eq. 4.3|
|ESIM w/ BioELMo||97.01||88.47||74.06||58.53|
|Model||w/o A.S.||w/ A.S.|
|ESIM w/ BioELMo||78.47||63.32||79.62||64.91|
Results are shown in Table 6. Phase I is fine-tuning on PQA-A using question and context. Since PQA-A is imbalanced due to its collection process, a trivial majority baseline gets 92.76% accuracy. Other models have better accuracy and especially macro-F1 than majority baseline. Fine-tuned BioBERT performs best.
Results are shown in Table 7. Bootstrapping is a three-step process: fine-tuning on PQA-A, then on PQA-L and pseudo-labeling PQA-U. All three steps are using question and long answer as input. Expectedly, models perform better in this reasoning-free setting than they do in reasoning-required setting (for PQA-A, Eq. 2 results in Table 7 are better than the performance in Table 6; for PQA-L, Eq. 3 results in Table 7 are better than the performance in Table 5).
Results are shown in Table 8. In Phase II, since each model is fine-tuned on its own pseudo-labeled PQA-U instances, results are not comparable between models. While the ablation study in Table 5 clearly shows that Phase II is helpful, performance in Phase II doesn’t necessarily correlate with final performance on PQA-L.
We present PubMedQA, a novel dataset aimed at biomedical research question answering using yes/no/maybe, where complex quantitative reasoning is required to solve the task. PubMedQA has substantial automatically collected instances as well as the largest size of expert annotated yes/no/maybe questions in biomedical domain. We provide a strong baseline using multi-phase fine-tuning of BioBERT with long answer as additional supervision, but it’s still much worse than just single human performance.
There are several interesting future directions to explore on PubMedQA, e.g.: (1) about 21% of PubMedQA contexts contain no natural language descriptions of numbers, so how to properly handle these numbers is worth studying; (2) we use binary BoW statistics prediction as a simple demonstration for additional supervision of long answers. Learning a harder but more informative auxiliary task of long answer generation might lead to further improvements.
Articles of PubMedQA are biased towards clinical study-related topics (described in Appendix B), so PubMedQA has the potential to assist evidence-based medicine, which seeks to make clinical decisions based on evidence of high quality clinical studies. Generally, PubMedQA can serve as a benchmark for testing scientific reasoning abilities of machine reading comprehension models.
We are grateful for the anonymous reviewers of EMNLP who gave us very valuable comments and suggestions.
Appendix A Yes/no/maybe Answerability
Not all naturally occuring question titles from PubMed are answerable by yes/no/maybe. The first step of annotating PQA-L (as shown in algorithm 1) from pre-PQA-U is to manually identify questions that can be answered using yes/no/maybe. We labeled 1091 (about 50.2%) of 2173 question titles as unanswerable. For example, those questions cannot be answered by yes/no/maybe:
“Critical Overview of HER2 Assessement in Bladder Cancer: What Is Missing for a Better Therapeutic Approach?” (wh- question)
“Otolaryngology externships and the match: Productive or futile?” (multiple choices)
Appendix B Over-represented Topics
Clinical study-related topics are over-represented in PubMedQA: we found proportions of MeSH terms like:
are significantly higher in the PubMedQA articles than those in 200k most recent general PubMed articles (significance is defined by in two-proportion z-test).
Appendix C Annotation Criteria
Strictly speaking, most yes/no/maybe research questions can be answered by “maybe” since there will always be some conditions where one statement is true and vice versa. However, the task will be trivial in this case. Instead, we annotate a question using “yes” if the experiments and results in the paper indicate it, so the answer is not universal but context-dependent.
Given a question like “Do patients benefit from drug X?”: certainly not all patients will benefit from it, but if there is a significant difference in an outcome between the experimental and control group, the answer will be “yes”. If there is not, the answer will be “no”.
“Maybe” is annotated when (1) the paper discusses conditions where the answer is True and conditions where the answer is False or (2) more than one intervention/observation/etc. is asked, and the answer is True for some but False for the others (e.g.: “Do Disease A, Disease B and/or Disease C benefit from drug X?”). To model uncertainty of the answer, we don’t strictly follow the logic calculations where such questions can always be answered by either “yes” or “no”.