HEAD-QA: A Healthcare Dataset for Complex Reasoning

HEAD-QA: A Healthcare Dataset for Complex Reasoning

David Vilares
Departamento de Computación
Campus de Elviña s/n, 15071
A Coruña, Spain
david.vilares@udc.es

&Carlos Gómez-Rodríguez
Departamento de Computación
Campus de Elviña s/n, 15071
A Coruña, Spain
carlos.gomez@udc.es
Abstract

We present head-qa, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) head-qa challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.

HEAD-QA: A Healthcare Dataset for Complex Reasoning

David Vilares Universidade da Coruña, CITIC Departamento de Computación Campus de Elviña s/n, 15071 A Coruña, Spain david.vilares@udc.es                        Carlos Gómez-Rodríguez Universidade da Coruña, CITIC Departamento de Computación Campus de Elviña s/n, 15071 A Coruña, Spain carlos.gomez@udc.es

1 Introduction

Recent progress in question answering (qa) has been led by neural models (Seo et al., 2016; Kundu and Ng, 2018), due to their ability to process raw texts. However, some authors (Kaushik and Lipton, 2018; Clark et al., 2018) have discussed the tendency of research to develop datasets and methods that accomodate the data-intensiveness and strengths of current neural methods.

This is the case of popular English datasets such as bAbI (Weston et al., 2015) or SQuAD (Rajpurkar et al., 2016, 2018), where some systems achieve near human-level performance (Hu et al., 2018; Xiong et al., 2017) and often surface-level knowledge suffices to answer. To counteract this, Clark et al. (2016) and Clark et al. (2018) have encouraged progress by developing multi-choice datasets that require reasoning. The questions match grade-school science, due to the difficulties to collect specialized questions. With a similar aim, Lai et al. (2017) released 100k questions and 28k passages intended for middle or high school Chinese students, and Zellers et al. (2018) introduced a dataset for common sense reasoning from a spectrum of daily situations.

However, this kind of dataset is scarce for complex domains like medicine: while challenges have been proposed in such domains, like textual entailment (Abacha et al., 2015; Abacha and Dina, 2016) or answering questions about specific documents and snippets (Nentidis et al., 2018), we know of no resources that require general reasoning on complex domains. The novelty of this work falls in this direction, presenting a multi-choice qa task that combines the need of knowledge and reasoning with complex domains, and which takes humans years of training to answer correctly.

Contribution

We present head-qa, a multi-choice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology (see Table 1111These examples were translated by humans to English.). The data is in Spanish, but we also include an English version. We then test models for open-domain and multi-choice qa, showing the complexity of the dataset and its utility to encourage progress in qa. head-qa and models can be found at http://aghie.github.io/head-qa/.

The Ministerio de Sanidad, Consumo y Bienestar Social (as a part of the Spanish government) announces every year examinations to apply for specialization positions in its public healthcare areas. The applicants must have a bachelor’s degree in the corresponding area (from 4 to 6 years) and they prepare the exam for a period of one year or more, as the vacancies are limited. The exams are used to discriminate among thousands of applicants, who will choose a specialization and location according to their mark (e.g., in medicine, to access a cardiology or gynecology position at a given hospital).

We use these examinations (from 2013 to present) to create head-qa. We consider questions involving the following healthcare areas: medicine (aka mir), pharmacology (fir), psychology (pir), nursing (eir), biology (bir), and chemistry (qir).333Radiophysics exams are excluded, due to the difficulty to parse their content (e.g. equations) from the pdf files.444Some of the questions might be considered invalid after the exams. We remove those questions from the final dataset. Exams from 2013 and 2014 are multi-choice tests with five options, while the rest of them have just four. The questions mainly refer to technical matters, although some of them also consider social issues (e.g. how to deal with patients in stressful situations). A small percentage (14%) of the medicine questions refer to images that provide additional information to answer correctly. These are included as a part of the corpus, although we will not exploit them in this work. For clarity, Table 4 shows an example:555Note that images often correspond to serious injuries and diseases. Viewer discretion is advised. The quality of the images varies widely, but it is good enough that the pictures can be analyzed by humans in a printed version. Figure 1 has 1037x1033 pixels.

 Question Question linked to image no 21. A 38-year-old bank employee who has been periodically checked by her company is referred to us to assess the chest X-ray. The patient smokes 20 cigarettes / day from the age of 21. She says that during the last months, she is somewhat more tired than usual. The basic laboratory tests are normal except for an Hb of 11.4 g / dL. An electrocardiogram and forced spirometry are normal. What do you think is the most plausible diagnostic orientation? 1. Hodgkin’s disease. 2. Histoplasmosis type fungal infection. 3. Sarcoidosis. 4. Bronchogenic carcinoma.

We describe in detail the json structure of head-qa in Appendix A. We enumerate below the fields for a given sample:

• The question id and the question’s content.

• Path to the image referred to in the question (if any).

• A list with the possible answers. Each answer is composed of the answer ID and its text.

• The id of the right answer for that question.

Although all the approaches that we will be testing are unsupervised or distant-supervised, we additionally define official training, development and test splits, so future research with supervised approaches can be compared with the work presented here. For this supervised setting, we choose the 2013 and 2014 exams for the training set, 2015 for the development set, and the rest for testing. The statistics are shown in Tables 2 and 3. It is worth noting that a common practice to divide a dataset is to rely on randomized splits to avoid potential biases in the collected data. We decided not to follow this strategy for two reasons. First, the questions and the number of questions per area are designed by a team of healthcare experts who already try to avoid these biases. Second (and more relevant), random splits would impede comparison against official (and aggregated) human results.

Finally, we hope to increase the size of head-qa by including questions from future exams.

English version

head-qa is in Spanish, but we include a translation to English (head-qa-en) using the Google API, which we use to perform cross-lingual experiments. We evaluated the quality of the translation using a sample of 60 random questions and their answers. We relied on two fluent Spanish-English speakers to score the adequacy666Adequacy: How much meaning is preserved? We use a scale from 5 to 1: 5 (all meaning), 4 (most meaning), 3 (some meaning), 2 (little meaning), 1 (none). and on one native English speaker for the fluency,777Fluency: Is the language in the output fluent? We use a scale from 5 to 1: 5 (flawless), 4 (good), 3 (non-native), 2 (disfluent), 1 (incomprehensible). following the scale by Koehn and Monz (2006). The average scores for adequacy were 4.35 and 4.71 out of 5, i.e. most of the meaning is captured; and for fluency 4 out of 5, i.e. good. As a side note, it was observed by the annotators that most names of diseases were successfully translated to English. On the negative side, the translator tended to struggle with elements such as molecular formulae, relatively common in chemistry questions.888This particular issue is not only due to the automatic translation process, but also to the difficulty of correctly mapping these elements from PDF exams to plain text.

3 Methods

Notation

We represent head-qa as a list of tuples: , where: is a question and are the possible answers. We use to denote the predicted answer, ignoring indexes when not needed.

Kaushik and Lipton (2018) discuss on the need of providing rigorous baselines that help better understand the improvement coming from future models, and also the need of avoiding architectural novelty when introducing new datasets. For this reason, our baselines are based on state-of-the-art systems used in open-domain and multi-choice qa (Chen et al., 2017; Kembhavi et al., 2017; Khot et al., 2018; Clark et al., 2018).

3.1 Control methods

Given the complex nature of the task, we include three control methods:

Random

Sampling , where is a random distribution.

Blindx

= . Always chosing the th option. Tests made by the examiners are not totally random (Poundstone, 2014) and right answers tend occur more in middle options.

Length

Choosing the longest answer.999Computed as the number of characters in the answers. Poundstone (2014) points out that examiners have to make sure that the right answer is totally correct, which might take more space.

3.2 Strong multi-choice methods

We evaluate an information retrieval (ir) model for head-qa and cross-lingual models for head-qa-en. Following Chen et al. (2017), we use Wikipedia as our source of information ()101010We downloaded Spanish and English Wikipedia dumps. for all the baselines. We then extract the raw text and remove the elements that add some type of structure (headers, tables, …).

3.2.1 Spanish information retrieval

Let be a question with its possible answers, we first create a set of queries of the form , which will be sent separately to a search engine. In particular, we use the DrQA’s Document Retriever (Chen et al., 2017), which scores the relation between the queries and the articles as tf-idf weighted bag-of-word vectors, and also takes into account word order and bi-gram counting. The predicted answer is defined as = , i.e. the answer in the query for which we obtained the highest document relevance. This is equivalent to the ir baselines by Clark et al. (2016, 2018).

3.2.2 Cross-lingual methods

Although some research on Spanish qa has been done in the last decade (Magnini et al., 2003; Vicedo et al., 2003; Buscaldi and Rosso, 2006; Kamateri et al., 2019), most recent work has been done for English, in part due to the larger availability of resources. On the one hand this is interesting because we hope head-qa will encourage research on multilingual question answering. On the other hand, we want to check how challenging the dataset is for state-of-the-art systems, usually available only for English. To do so, we use head-qa-en, as the adequacy and the fluency scores of the translation were high.

Cross-lingual Information Retrieval

The ir baseline, but applied to head-qa-en. We also use this baseline as an extrinsic way to evaluate the quality of the translation, expecting to obtain a performance similar to the Spanish ir model.

Multi-choice DrQA

(Chen et al., 2017) DrQA first returns the 5 most relevant documents for each question, relying on the information retrieval system described above. It will then try to find the exact span in them containing the right answer on such documents, using a document reader. For this, the authors rely on a neural network system inspired in the Attentive Reader (Hermann et al., 2015) that was trained over SQuAD (Rajpurkar et al., 2016). The original DrQA is intended for open-domain qa, focusing on factoid questions. To adapt it to a multi-choice setup, to select we compare the selected span against all the answers and select the one that shares the largest percentage of tokens.121212We lemmatize and remove the stopwords as in (Clark et al., 2018). We however observed that many of selected spans did not have any word in common with any of the answers. If this happens, we select the longest answer. Non-factoid questions (common in head-qa) are not given any special treatment.

Multi-choice BiDAF

(Clark et al., 2018) Similar to the multi-choice DrQA, but using a BiDAF architecture as the document reader (Seo et al., 2016). The way BiDAF is trained is also different: they first trained the reader on SQuAD, but then further tuned to science questions presented in (Clark et al., 2018), using continued training. This system might select as correct more than one answer. If this happens, we follow a simple approach and select the longest one.

Multi-choice DGEM and Decompatt

(Clark et al., 2018) The models adapt the DGEM (Parikh et al., 2016) and Decompatt (Khot et al., 2018) entailment systems. They consider a set of hypothesis = and each is used as a query to retrieve a set of relevant sentences, . Then, an entailment score is computed for every and , where is the answer inside that maximizes the score. If multiple answers are selected, we choose the longest one.

4 Experiments

Metrics

We use accuracy and a points metric (used in the official exams): a right answer counts 3 points and a wrong one subtracts 1 point.131313Note that as some exams have more choices than others, there is not a direct correspondence between accuracy and points (a given healthcare area might have better accuracy than another one, but worse points score).

Results (unsupervised setting)

Tables 5 and 6 show the accuracy and points scores for both head-qa and head-qa-en. The cross-lingual ir model obtains even a greater performance than the Spanish one. This is another indicator that the translation is good enough to apply cross-lingual approaches. On the negative side, the approaches based on current neural architectures obtain a lower performance.

Results (supervised setting)

We show in Tables 7 and 8 the performance of the top models on the test split corresponding to the supervised setting.

Discussion

Medicine questions (mir) are the hardest ones to answer across the board. We believe this is due to the greater length of both the questions and the answers (this was shown in Table 3). This hypothesis is supported by the lower results on the nursing domain (eir), the category with the second longest questions/answers. On the contrary, the categories for which we obtained the better results, such as pharmacology (fir) or biology (bir), have shorter questions and answers. While the evaluated models surpass all control methods, their performance is still well behind the human performance. We illustrate this in Table 9, comparing the performance (points score) of our best model against a summary of the results, on the 2016 exams.1414142016 was the annual examination for which we were able to find more available information. Also, the best performing model was a non-machine learning model based on standard information retrieval techniques. This reinforces the need for effective information extraction techniques that can be later used to perform complex reasoning with machine learning models.

5 Conclusion

We presented a complex multi-choice dataset containing questions about medicine, nursing, biology, pharmacology, psychology and chemistry. Such questions correspond to examinations to access specialized positions in the Spanish healthcare system, and require specialized knowledge and reasoning to be answered. To check its complexity, we then tested different state-of-the-art models for open-domain and multi-choice questions. We show how they struggle with the challenge, being clearly surpassed by a non-machine learning model based on information retrieval. We hope this work will encourage research on designing more powerful qa systems that can carry out effective information extraction and reasoning.

We also believe there is room for alternative challenges in head-qa. In this work we have used it as a closed qa dataset (the potential answers are used as input to determine the right one). Nothing prevents to use the dataset in an open setting, where the system is given no clue about the possible answers. This would require to think as well whether widely used metrics such as bleu (Papineni et al., 2002) or exact match could be appropriate for this particular problem.

Acknowlegments

This work has received support from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) and the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, from Xunta de Galicia (ED431B 2017/01), and from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150). We thank Mark Anderson for his help with translation fluency evaluation.

References

• Abacha and Dina (2016) Asma Ben Abacha and Demner-Fushman Dina. 2016. In AMIA Annual Symposium Proceedings, volume 2016, page 310. American Medical Informatics Association.
• Abacha et al. (2015) Asma Ben Abacha, Duy Dinh, and Yassine Mrabet. 2015. Semantic analysis and automatic corpus construction for entailment recognition in medical texts. In Conference on Artificial Intelligence in Medicine in Europe, pages 238–242. Springer.
• Buscaldi and Rosso (2006) Davide Buscaldi and Paolo Rosso. 2006. In Proceedings of the International Conference on Language Resources and Evaluation, pages 727–730.
• Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
• Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. arXiv preprint arXiv:1803.05457.
• Clark et al. (2016) Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter D Turney, and Daniel Khashabi. 2016. In AAAI, pages 2580–2586.
• Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. In Advances in Neural Information Processing Systems, pages 1693–1701.
• Hu et al. (2018) Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18).
• Kamateri et al. (2019) Eleni Kamateri, Theodora Tsikrika, Spyridon Symeonidis, Stefanos Vrochidis, Wolfgang Minker, and Yiannis Kompatsiaris. 2019. A test collection for passage retrieval evaluation of spanish health-related resources. In European Conference on Information Retrieval, pages 148–154. Springer.
• Kaushik and Lipton (2018) Divyansh Kaushik and Zachary C. Lipton. 2018. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015, Brussels, Belgium. Association for Computational Linguistics.
• Kembhavi et al. (2017) Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4999–5007.
• Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. In Proceedings of AAAI.
• Koehn and Monz (2006) Philipp Koehn and Christof Monz. 2006. In Proceedings on the Workshop on Statistical Machine Translation, pages 102–121, New York City. Association for Computational Linguistics.
• Kundu and Ng (2018) Souvik Kundu and Hwee Tou Ng. 2018. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4243–4252, Brussels, Belgium. Association for Computational Linguistics.
• Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics.
• Magnini et al. (2003) Bernardo Magnini, Simone Romagnoli, Alessandro Vallin, Jesús Herrera, Anselmo Penas, Víctor Peinado, Felisa Verdejo, and Maarten de Rijke. 2003. The multiple language question answering track at clef 2003. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 471–486. Springer.
• Nentidis et al. (2018) Anastasios Nentidis, Anastasia Krithara, Konstantinos Bougiatiotis, Georgios Paliouras, and Ioannis Kakadiaris. 2018. In Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering, pages 1–10. Association for Computational Linguistics.
• Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
• Parikh et al. (2016) Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255. Association for Computational Linguistics.
• Poundstone (2014) William Poundstone. 2014. Rock breaks scissors: a practical guide to outguessing and outwitting almost everybody. Hachette UK.
• Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. pages 784–789.
• Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
• Seo et al. (2016) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. arXiv preprint arXiv:1611.01603.
• Vicedo et al. (2003) José L Vicedo, Ruben Izquierdo, Fernando Llopis, and Rafael Munoz. 2003. Question answering in spanish. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 541–548. Springer.
• Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. arXiv preprint arXiv:1502.05698.
• Xiong et al. (2017) Caiming Xiong, Victor Zhong, and Richard Socher. 2017. arXiv preprint arXiv:1711.00106.
• Zellers et al. (2018) Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.

Appendix A Appendices

We below describe the fields of the json file use to represent head-qa.

{
"version": 1.0
"language": ["es","en"]
"exams": A list of exams.
"year": e.g. 2016.
"category": [’medicine’,’biology’,
’nursing’,’pharmacology’,
’chemistry’,’psychology’]
"qid": The question ID, extracted
from the original PDF exam
(usually between 1 and 235).
"qtext" : The text of the question.
"ra" : The ID of the right answer.
"aid": The answer ID (1 to 5).
"atext": The text of the answer.
}

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters