HEAD-QA: A Healthcare Dataset for Complex Reasoning

HEAD-QA: A Healthcare Dataset for Complex Reasoning

David Vilares
Universidade da Coruña, CITIC
Departamento de Computación
Campus de Elviña s/n, 15071
A Coruña, Spain

&Carlos Gómez-Rodríguez
Universidade da Coruña, CITIC
Departamento de Computación
Campus de Elviña s/n, 15071
A Coruña, Spain

We present head-qa, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) head-qa challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.

HEAD-QA: A Healthcare Dataset for Complex Reasoning

David Vilares Universidade da Coruña, CITIC Departamento de Computación Campus de Elviña s/n, 15071 A Coruña, Spain david.vilares@udc.es                        Carlos Gómez-Rodríguez Universidade da Coruña, CITIC Departamento de Computación Campus de Elviña s/n, 15071 A Coruña, Spain carlos.gomez@udc.es

1 Introduction

Recent progress in question answering (qa) has been led by neural models (Seo et al., 2016; Kundu and Ng, 2018), due to their ability to process raw texts. However, some authors (Kaushik and Lipton, 2018; Clark et al., 2018) have discussed the tendency of research to develop datasets and methods that accomodate the data-intensiveness and strengths of current neural methods.

This is the case of popular English datasets such as bAbI (Weston et al., 2015) or SQuAD (Rajpurkar et al., 2016, 2018), where some systems achieve near human-level performance (Hu et al., 2018; Xiong et al., 2017) and often surface-level knowledge suffices to answer. To counteract this, Clark et al. (2016) and Clark et al. (2018) have encouraged progress by developing multi-choice datasets that require reasoning. The questions match grade-school science, due to the difficulties to collect specialized questions. With a similar aim, Lai et al. (2017) released 100k questions and 28k passages intended for middle or high school Chinese students, and Zellers et al. (2018) introduced a dataset for common sense reasoning from a spectrum of daily situations.

However, this kind of dataset is scarce for complex domains like medicine: while challenges have been proposed in such domains, like textual entailment (Abacha et al., 2015; Abacha and Dina, 2016) or answering questions about specific documents and snippets (Nentidis et al., 2018), we know of no resources that require general reasoning on complex domains. The novelty of this work falls in this direction, presenting a multi-choice qa task that combines the need of knowledge and reasoning with complex domains, and which takes humans years of training to answer correctly.

Question (medicine): A 13-year-old girl is operated on due to Hirschsprung illness at 3 months of age. Which of the following tumors is more likely to be present?
1. Abdominal neuroblastoma
2. Wilms tumor
3. Mesoblastic nephroma
4. Familial thyroid medullary carcinoma.
Question (pharmacology) The antibiotic treatment of choice for Meningitis caused by Haemophilus influenzae serogroup b is:
1. Gentamicin
2. Erythromycin
3. Ciprofloxacin
4. Cefotaxime
Question (psychology) According to research derived from the Eysenck model, there is evidence that extraverts, in comparison with introverts:
1. Perform better in surveillance tasks.
2. Have greater salivary secretion before the lemon juice test.
3. Have a greater need for stimulation.
4. Have less tolerance to pain.
Table 1: Samples from head-qa

We present head-qa, a multi-choice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology (see Table 1111These examples were translated by humans to English.). The data is in Spanish, but we also include an English version. We then test models for open-domain and multi-choice qa, showing the complexity of the dataset and its utility to encourage progress in qa. head-qa and models can be found at http://aghie.github.io/head-qa/.

2 The head-qa corpus

The Ministerio de Sanidad, Consumo y Bienestar Social222https://www.mscbs.gob.es/ (as a part of the Spanish government) announces every year examinations to apply for specialization positions in its public healthcare areas. The applicants must have a bachelor’s degree in the corresponding area (from 4 to 6 years) and they prepare the exam for a period of one year or more, as the vacancies are limited. The exams are used to discriminate among thousands of applicants, who will choose a specialization and location according to their mark (e.g., in medicine, to access a cardiology or gynecology position at a given hospital).

Category Unsupervised Supervised setting
setting Train Dev Test
Biology 1,132 452 226 454
nursing 1,069 384 230 455
Pharmacology 1,139 457 225 457
Medicine 1149 455 231 463
Psychology 1134 453 226 455
Chemistry 1142 456 228 458
Total 6,765 2,657 1,366 2,742
Table 2: Number of questions in head-qa
Category Longest Avg Longest Avg
question question answer answer
Biology 43 11.11 40 5.08
Nursing 187 29.03 94 9.54
Pharmacology 104 18.18 43 6.70
Medicine 308 55.29 85 9.31
Psychology 103 21.91 43 7.98
Chemistry 63 15.82 52 7.62
Table 3: Tokens statistics in head-qa

We use these examinations (from 2013 to present) to create head-qa. We consider questions involving the following healthcare areas: medicine (aka mir), pharmacology (fir), psychology (pir), nursing (eir), biology (bir), and chemistry (qir).333Radiophysics exams are excluded, due to the difficulty to parse their content (e.g. equations) from the pdf files.444Some of the questions might be considered invalid after the exams. We remove those questions from the final dataset. Exams from 2013 and 2014 are multi-choice tests with five options, while the rest of them have just four. The questions mainly refer to technical matters, although some of them also consider social issues (e.g. how to deal with patients in stressful situations). A small percentage (14%) of the medicine questions refer to images that provide additional information to answer correctly. These are included as a part of the corpus, although we will not exploit them in this work. For clarity, Table 4 shows an example:555Note that images often correspond to serious injuries and diseases. Viewer discretion is advised. The quality of the images varies widely, but it is good enough that the pictures can be analyzed by humans in a printed version. Figure 1 has 1037x1033 pixels.

Figure 1: Image no 21 from MIR 2017
Question Question linked to image no 21. A 38-year-old bank employee who has been periodically checked by her company is referred to us to assess the chest X-ray. The patient smokes 20 cigarettes / day from the age of 21. She says that during the last months, she is somewhat more tired than usual. The basic laboratory tests are normal except for an Hb of 11.4 g / dL. An electrocardiogram and forced spirometry are normal. What do you think is the most plausible diagnostic orientation?
1. Hodgkin’s disease.
2. Histoplasmosis type fungal infection.
3. Sarcoidosis.
4. Bronchogenic carcinoma.
Table 4: A question referring to Figure 1

We describe in detail the json structure of head-qa in Appendix A. We enumerate below the fields for a given sample:

  • The question id and the question’s content.

  • Path to the image referred to in the question (if any).

  • A list with the possible answers. Each answer is composed of the answer ID and its text.

  • The id of the right answer for that question.

Although all the approaches that we will be testing are unsupervised or distant-supervised, we additionally define official training, development and test splits, so future research with supervised approaches can be compared with the work presented here. For this supervised setting, we choose the 2013 and 2014 exams for the training set, 2015 for the development set, and the rest for testing. The statistics are shown in Tables 2 and 3. It is worth noting that a common practice to divide a dataset is to rely on randomized splits to avoid potential biases in the collected data. We decided not to follow this strategy for two reasons. First, the questions and the number of questions per area are designed by a team of healthcare experts who already try to avoid these biases. Second (and more relevant), random splits would impede comparison against official (and aggregated) human results.

Finally, we hope to increase the size of head-qa by including questions from future exams.

English version

head-qa is in Spanish, but we include a translation to English (head-qa-en) using the Google API, which we use to perform cross-lingual experiments. We evaluated the quality of the translation using a sample of 60 random questions and their answers. We relied on two fluent Spanish-English speakers to score the adequacy666Adequacy: How much meaning is preserved? We use a scale from 5 to 1: 5 (all meaning), 4 (most meaning), 3 (some meaning), 2 (little meaning), 1 (none). and on one native English speaker for the fluency,777Fluency: Is the language in the output fluent? We use a scale from 5 to 1: 5 (flawless), 4 (good), 3 (non-native), 2 (disfluent), 1 (incomprehensible). following the scale by Koehn and Monz (2006). The average scores for adequacy were 4.35 and 4.71 out of 5, i.e. most of the meaning is captured; and for fluency 4 out of 5, i.e. good. As a side note, it was observed by the annotators that most names of diseases were successfully translated to English. On the negative side, the translator tended to struggle with elements such as molecular formulae, relatively common in chemistry questions.888This particular issue is not only due to the automatic translation process, but also to the difficulty of correctly mapping these elements from PDF exams to plain text.

3 Methods


We represent head-qa as a list of tuples: , where: is a question and are the possible answers. We use to denote the predicted answer, ignoring indexes when not needed.

Kaushik and Lipton (2018) discuss on the need of providing rigorous baselines that help better understand the improvement coming from future models, and also the need of avoiding architectural novelty when introducing new datasets. For this reason, our baselines are based on state-of-the-art systems used in open-domain and multi-choice qa (Chen et al., 2017; Kembhavi et al., 2017; Khot et al., 2018; Clark et al., 2018).

3.1 Control methods

Given the complex nature of the task, we include three control methods:


Sampling , where is a random distribution.


= . Always chosing the th option. Tests made by the examiners are not totally random (Poundstone, 2014) and right answers tend occur more in middle options.


Choosing the longest answer.999Computed as the number of characters in the answers. Poundstone (2014) points out that examiners have to make sure that the right answer is totally correct, which might take more space.

3.2 Strong multi-choice methods

We evaluate an information retrieval (ir) model for head-qa and cross-lingual models for head-qa-en. Following Chen et al. (2017), we use Wikipedia as our source of information ()101010We downloaded Spanish and English Wikipedia dumps. for all the baselines. We then extract the raw text and remove the elements that add some type of structure (headers, tables, …).111111 github.com/attardi/wikiextractor

3.2.1 Spanish information retrieval

Let be a question with its possible answers, we first create a set of queries of the form , which will be sent separately to a search engine. In particular, we use the DrQA’s Document Retriever (Chen et al., 2017), which scores the relation between the queries and the articles as tf-idf weighted bag-of-word vectors, and also takes into account word order and bi-gram counting. The predicted answer is defined as = , i.e. the answer in the query for which we obtained the highest document relevance. This is equivalent to the ir baselines by Clark et al. (2016, 2018).

3.2.2 Cross-lingual methods

Although some research on Spanish qa has been done in the last decade (Magnini et al., 2003; Vicedo et al., 2003; Buscaldi and Rosso, 2006; Kamateri et al., 2019), most recent work has been done for English, in part due to the larger availability of resources. On the one hand this is interesting because we hope head-qa will encourage research on multilingual question answering. On the other hand, we want to check how challenging the dataset is for state-of-the-art systems, usually available only for English. To do so, we use head-qa-en, as the adequacy and the fluency scores of the translation were high.

Cross-lingual Information Retrieval

The ir baseline, but applied to head-qa-en. We also use this baseline as an extrinsic way to evaluate the quality of the translation, expecting to obtain a performance similar to the Spanish ir model.

Multi-choice DrQA

(Chen et al., 2017) DrQA first returns the 5 most relevant documents for each question, relying on the information retrieval system described above. It will then try to find the exact span in them containing the right answer on such documents, using a document reader. For this, the authors rely on a neural network system inspired in the Attentive Reader (Hermann et al., 2015) that was trained over SQuAD (Rajpurkar et al., 2016). The original DrQA is intended for open-domain qa, focusing on factoid questions. To adapt it to a multi-choice setup, to select we compare the selected span against all the answers and select the one that shares the largest percentage of tokens.121212We lemmatize and remove the stopwords as in (Clark et al., 2018). We however observed that many of selected spans did not have any word in common with any of the answers. If this happens, we select the longest answer. Non-factoid questions (common in head-qa) are not given any special treatment.

Multi-choice BiDAF

(Clark et al., 2018) Similar to the multi-choice DrQA, but using a BiDAF architecture as the document reader (Seo et al., 2016). The way BiDAF is trained is also different: they first trained the reader on SQuAD, but then further tuned to science questions presented in (Clark et al., 2018), using continued training. This system might select as correct more than one answer. If this happens, we follow a simple approach and select the longest one.

Multi-choice DGEM and Decompatt

(Clark et al., 2018) The models adapt the DGEM (Parikh et al., 2016) and Decompatt (Khot et al., 2018) entailment systems. They consider a set of hypothesis = and each is used as a query to retrieve a set of relevant sentences, . Then, an entailment score is computed for every and , where is the answer inside that maximizes the score. If multiple answers are selected, we choose the longest one.

4 Experiments


We use accuracy and a points metric (used in the official exams): a right answer counts 3 points and a wrong one subtracts 1 point.131313Note that as some exams have more choices than others, there is not a direct correspondence between accuracy and points (a given healthcare area might have better accuracy than another one, but worse points score).

Results (unsupervised setting)

Tables 5 and 6 show the accuracy and points scores for both head-qa and head-qa-en. The cross-lingual ir model obtains even a greater performance than the Spanish one. This is another indicator that the translation is good enough to apply cross-lingual approaches. On the negative side, the approaches based on current neural architectures obtain a lower performance.

Model bir mir eir fir pir qir Avg
ES random 24.2 22.0 25.1 23.2 24.0 24.5 23.8
blind 23.7 22.8 22.7 22.4 22.5 21.2 22.5
blind 25.6 24.3 23.5 23.0 25.3 24.9 24.4
blind 23.0 24.7 26.5 25.8 22.9 25.1 24.7
blind 22.6 20.0 21.7 22.4 23.2 22.5 22.1
length 26.9 24.9 28.6 28.7 30.6 29.0 28.1
ir 34.5 26.5 32.7 35.5 34.2 34.2 32.9
EN ir 37.9 30.3 32.6 38.7 34.7 33.7 34.6
drqa 29.5 25.0 27.3 28.3 31.0 30.2 28.5
bidaf 33.4 26.2 26.8 29.9 26.8 30.3 28.9
dgem 31.7 25.7 28.7 29.9 28.5 30.3 29.1
decompatt 30.6 23.6 27.9 27.2 28.3 27.6 27.5
Table 5: Accuracy on the head-qa and head-qa-en corpora (unsupervised setting)
Model bir mir eir fir pir qir Avg
ES blind -17.6 -2.6 16.6 7.4 -18.8 1.2 -2.3
length 16.8 -1.0 32.6 33.8 50.8 36.4 28.2
ir 86.4 14.2 67.0 95.4 82.8 84.4 71.7
EN ir 116.8 48.6 67.8 125.0 87.6 79.6 87.6
drqa 40.8 -0.2 20.6 29.8 54.0 47.6 32.1
bidaf 75.6 11.0 15.8 44.4 16.6 48.6 35.3
dgem 60.8 7.0 34.2 45.0 31.6 48.4 37.8
decompatt 51.2 -13.0 27.8 20.2 30.0 23.6 23.3
Table 6: points on the head-qa and head-qa-en corpora (unsupervised setting)
Results (supervised setting)

We show in Tables 7 and 8 the performance of the top models on the test split corresponding to the supervised setting.

Model bir mir eir fir pir qir Avg
ES random 24.2 23.1 25.2 23.8 27.9 27.7 25.3
blind 26.0 27.5 29.8 27.2 24.8 27.8 27.2
length 32.4 27.0 32.8 30.2 30.5 30.1 30.5
ir 36.5 26.3 36.0 40.3 35.9 36.2 35.2
EN ir 39.8 33.3 36.4 42.2 35.7 36.0 37.2
bidaf 36.5 26.6 27.7 29.3 28.1 34.1 30.3
dgem 31.7 27.2 30.7 29.9 31.0 33.2 30.6
Table 7: Accuracy on the head-qa and head-qa-en corpora (supervised setting)
Model bir mir eir fir pir qir Avg
ES random -7.0 -17.5 2.5 -10-5 26.5 25.0 3.2
blind 9.0 22.5 44.5 19.5 -1.5 25.0 19.8
length 67.0 18.5 70.5 47.5 50.5 47.0 50.2
ir 105.0 12.5 100.5 139.5 98.5 103.0 93.2
EN ir 135.0 76.5 104.5 157.5 96.5 101.0 111.8
bidaf 104.0 14.5 18.5 39.0 29.0 83.0 48.0
dgem 61.0 20.5 52.5 45.5 54.5 75.0 51.5
Table 8: points on the head-qa and head-qa-en corpora (supervised setting)

Medicine questions (mir) are the hardest ones to answer across the board. We believe this is due to the greater length of both the questions and the answers (this was shown in Table 3). This hypothesis is supported by the lower results on the nursing domain (eir), the category with the second longest questions/answers. On the contrary, the categories for which we obtained the better results, such as pharmacology (fir) or biology (bir), have shorter questions and answers. While the evaluated models surpass all control methods, their performance is still well behind the human performance. We illustrate this in Table 9, comparing the performance (points score) of our best model against a summary of the results, on the 2016 exams.1414142016 was the annual examination for which we were able to find more available information. Also, the best performing model was a non-machine learning model based on standard information retrieval techniques. This reinforces the need for effective information extraction techniques that can be later used to perform complex reasoning with machine learning models.

bir mir eir fir pir qir
Avg 10 best 627.1 592.2 515.2 575.5 602.1 529.1
Pass mark 219.0 207.0 180.0 201.0 210.0 185.0
EN ir 168.0 124.0 77.0 132.0 62.0 93.0
Table 9: Human performance on the 2016 exams. The results are not strictly comparable, as the last 10 questions are considered as backup questions in the human exams, but still show how far the tested baselines are from human performance.

5 Conclusion

We presented a complex multi-choice dataset containing questions about medicine, nursing, biology, pharmacology, psychology and chemistry. Such questions correspond to examinations to access specialized positions in the Spanish healthcare system, and require specialized knowledge and reasoning to be answered. To check its complexity, we then tested different state-of-the-art models for open-domain and multi-choice questions. We show how they struggle with the challenge, being clearly surpassed by a non-machine learning model based on information retrieval. We hope this work will encourage research on designing more powerful qa systems that can carry out effective information extraction and reasoning.

We also believe there is room for alternative challenges in head-qa. In this work we have used it as a closed qa dataset (the potential answers are used as input to determine the right one). Nothing prevents to use the dataset in an open setting, where the system is given no clue about the possible answers. This would require to think as well whether widely used metrics such as bleu (Papineni et al., 2002) or exact match could be appropriate for this particular problem.


This work has received support from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) and the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, from Xunta de Galicia (ED431B 2017/01), and from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150). We thank Mark Anderson for his help with translation fluency evaluation.


Appendix A Appendices

We below describe the fields of the json file use to represent head-qa.

 "version": 1.0
 "language": ["es","en"]
 "exams": A list of exams.
    "name": Cuaderno_YEAR_1_*IR_ACRONYM.
    "year": e.g. 2016.
    "category": [’medicine’,’biology’,
    "data": A list of questions/answers.
       "qid": The question ID, extracted
              from the original PDF exam
              (usually between 1 and 235).
       "qtext" : The text of the question.
       "ra" : The ID of the right answer.
       "answers": A list with the answer options.
          "aid": The answer ID (1 to 5).
          "atext": The text of the answer.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description