Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning

Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning

Titipat Achakulvisut
Department of Bioengineering
University of Pennsylvania
Philadelphia, PA, USA
\AndChandra Bhagavatula
Allen Institute for Artificial Intelligence
Seattle, WA, USA \ANDDaniel Acuna
School of Information Studies
Syracuse University
Syracuse, NY, USA
\AndKonrad Kording
Department of Bioengineering
University of Pennsylvania
PA, Pennsylvania, USA

Claims are a fundamental unit of scientific discourse. The exponential growth in the number of scientific publications makes automatic claim extraction an important problem for researchers who are overwhelmed by this information overload. Such an automated claim extraction system is useful for both manual and programmatic exploration of scientific knowledge. In this paper, we introduce an online claim extraction system and a dataset of 1,500 scientific abstracts from the biomedical domain with expert annotations for each sentence indicating whether the sentence presents a scientific claim. We compare our proposed model with several baseline models including rule-based and deep learning techniques. Our transfer learning approach with a fine-tuning step allows us to bootstrap from a large discourse-annotated dataset (Pubmed-RCT) and obtains F1-score over 0.78 for claim detection while using a small annotated dataset of 750 papers. We show that using this pre-trained model based on the discourse prediction task improves F1-score by over 14 percent absolute points compared to a baseline model without discourse structure. We release a publicly accessible tool for discourse model, claim detection model, along with an annotation tool. We discuss further applications beyond Biomedical literature.


Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning

 A Preprint
Titipat Achakulvisut Department of Bioengineering University of Pennsylvania Philadelphia, PA, USA Chandra Bhagavatula Allen Institute for Artificial Intelligence Seattle, WA, USA Daniel Acuna School of Information Studies Syracuse University Syracuse, NY, USA Konrad Kording Department of Bioengineering University of Pennsylvania PA, Pennsylvania, USA

July 2, 2019

Keywords Biomedical Claim   Scientific Claim Extraction   Recurrent Neural Network   Transfer Learning

1 Introduction

Claims are a fundamental unit of scientific discourse. However, the exponential growth of publications has made it challenging for researchers to keep track of research papers introducing new claims. Automatic extraction of claims from scientific articles promises to alleviate some of these problems, additionally supporting other scientific processes such as knowledge exploration, efficient reading [8], and automated text summarization [26]. DARPA’s recently launched Systematizing Confidence in Open Research and Evidence (SCORE) program to tackle automatic claim detection (with confidence measurement) underscores the importance of the task in wider contexts.

Recently, Thorne et. al. [27] released the FEVER dataset that consists of factual claims from Wikipedia validated by crowd-workers. Annotating scientific claims, on the other hand, requires significant domain expertise which makes it challenging to create a large dataset for training statistical models. Dernoncourt et. al. [8] released the PubmedRCT discourse-tagging dataset – a large-scale dataset of sentences from 200k structured abstracts from Pubmed, each sentence is either from background, introduction, method, result, or conclusion sections. However, the annotated dataset related to scientific domain is still limited as presented in [6, 24, 19]. Since claim extraction and discourse tagging are related, it creates an opportunity for applying techniques that exploit weak supervision and transfer learning [16].

In this paper, we introduce a scientific claim detection task with a dataset of 1,500 scientific papers with expertly annotated claims. This is the largest claim-annotated dataset specifically aimed towards scientific documents. We build on a neural discourse tagging model based on a Bidirectional LSTM (Bi-LSTM) with Conditional Random Field (CRF) [5] and transfer the representations to train a claim extraction model. Our fine-tuned model achieves 47% higher F1-score compared to the rule-based method presented in the previous research [24]. We show that pre-training the model on the Pubmed-RCT dataset allows our model to achieve 14% higher F1-score than a model trained solely on the claim extraction dataset. We make the code for claim extraction, our dataset, and our annotation tool publicly available to the community 111

Figure 1: Example of annotated claims for a paper in our dataset.

2 Dataset: An expertly annotated dataset of biomedical claims

We introduce a novel dataset of expertly annotated claims in biomedical paper abstracts. While there are multiple definitions of scientific claims proposed in previous literature, we follow previous definitions (e.g., [24]) to characterize a claim as a statement that either (1) declares something is better, (2) proposes something new, or (3) describes a new finding or a new cause-effect relationship. One abstract can have multiple claims.

Three annotators with biomedical domain expertise and fluency in English are selected for the task. The annotators were presented with instructions and a few examples before starting to annotate. The full instructions provided to experts on the annotation tool are detailed in Appendix A. An example of an annotated claim from an abstract is shown in Figure 1. The abstracts are sampled from the top 110 biomedical venues by number of abstracts in the MEDLINE database from the year 2008 to 2018 parsed using the Pubmed Parser library [28]. The dataset consists of 1,500 abstracts containing 11,702 sentences. The pairwise inter-annotator agreement (IAA) scores between the three annotators (Cohen’s Kappa, , [4, 2]) are 0.630, 0.575, and 0.678 respectively. The Fleiss’s Kappa between all annotators is 0.629. The final label for training the claim prediction model is computed as the majority vote between all three annotators, producing a total of 2,276 claim sentences. The low-IAA score is by nature of the task which is previously report in Lauscher et. al. [18]. The distribution of the relative position of the claims identified through this process are shown in 6 in Appendix A. We compare IAA of the final label with crowdsourcing from annotators with no biomedical background and get of 0.096. This shows that a great deal of background knowledge is required for the annotation. Thus, annotating abstract could be expensive because of this specialization.

3 The Discourse Prediction and Claim Extraction tasks

In this section, we introduce the discourse prediction task and the claim extraction task. Formally, an abstract is represented as a sequence of sentences . In PubmedRCT dataset, each sentence is associated with a discourse types , where {Objective, Introduction, Methods, Results, or Conclusions}. The discourse prediction task is to predict the discourse types for a sequence of sentences from a new abstract. The claim extraction task is to predict whether a sentence is a claim or not (0 / 1 classification) for given a sequence of sentences .

4 Discourse Prediction Model based on Structured Abstracts

To train a discourse prediction model, we use the PubmedRCT dataset [8] which contains discourse types associated with each sentence in 200,000 abstracts. We experiment with two main neural architectures – (i) vanilla Bi-LSTM model and (ii) Bi-LSTM with CRF layer (Bi-LSTM CRF) presented by [5]. The CRF layer takes the outputs at each timestep of the Bi-LSTM layer and uses them to jointly infer the most probable discourse type for all sentences in a given sequence of sentences. Transfer learning and fine tuning are applied to pre-trained discourse models to train the claim extraction model later on.

5 Claim Extraction Model

We experiment with the following three models for claim extraction.

5.1 Rule-based claim extraction

We implement a baseline model using the rule-based claim extraction algorithm presented by Sateli et. al. [24]. It processes part-of-speech patterns and pre-defined set of keywords that signals claim statements. We re-implement the algorithm in Python with the spacy library [15] which achieves a similar F1-score to that reported in the original paper (Table 1).

5.2 Sentence embedding with discourse probability

We implement another baseline model using the sentence classification technique presented by Arora et. al. [1]. The sentence embedding is calculated by a weighted combination of the inverse word frequency in MEDLINE abstracts and word embeddings. Then, first principal component of the embedding is subtracted to capture second order information. Then, we concatenate a sentence embedding with discourse probability calculated by the discourse prediction model. We then apply regularized logistic regression to predict the probability of a claim.

5.3 Transfer Learning with fine-tuning from discourse prediction model

Figure 2: Schematic of Transfer Learning with fine-tuning technique. The last layer of the discourse classification model is replaced by the claim extraction model. Transfer learning and fine-tuning is applied to adapt from the learned discourse classification structure.

First, the discourse Bi-LSTM and Bi-LSTM CRF models are trained based on PubMedRCT dataset using pre-trained word embeddings selecting from Glove word vector [21] with 300 dimensions or PubMed pre-trained word vector with 200 dimensions [20]. We train the discourse model with batch size of 64 with a reduce-on-plateau learning rate scheduler with factor of 0.5 and Adam optimizer [17] with the learning rate of 0.001 for all experiments. All models are implemented using the AllenNLP library [12]. Then, we apply transfer learning technique and fine tuning to train the models on our expertly annotated dataset. The schematic of the training process can be found in figure 2.

Validation Test
Model Precision Recall F1-score Precision Recall F1-score
Rule-based (Sateli et. al.) 0.349 0.364 0.356 0.315 0.322 0.319
Last sentence as a claim 0.845 0.542 0.660 0.835 0.548 0.662
Sent Embedding 0.605 0.641 0.623 0.624 0.674 0.648
Sent Embedding + Discourse 0.709 0.723 0.716 0.715 0.711 0.713
Bi-LSTM CRF only annotation data 0.778 0.521 0.624 0.701 0.609 0.652
Bi-LSTM CRF Conclusion as Claim 0.616 0.773 0.685 0.582 0.792 0.671
Transfer Learning (Pubmed) 0.735 0.723 0.729 0.723 0.730 0.727
Transfer Learning (Glove) 0.738 0.765 0.751 0.762 0.729 0.745
Transfer Learning CRF (Pubmed) 0.840 0.764 0.800 0.887 0.685 0.773
Transfer Learning CRF (Glove) 0.859 0.750 0.801 0.866 0.727 0.790
Table 1: Model Evaluation on claim annotation dataset. Transfer Learning model is done by taking pre-trained discourse model for sentence prediction. Transfer Learning with CRF is done by taking pre-trained discourse model with CRF layer.

6 Evaluation and Results

We perform experiments using 50% of the corpus for training (750 articles), 25% for validation (375 articles), and 25% for testing (375 articles). We use precision, recall, and F1-score to evaluate the performance of the models (Table 1). We compare the performance of rule-based model, predicting last sentence as a claim, sentence embedding model, sentence embedding with a concatenation of discourse probabilities (Sentence Embedding + Discourse), Bi-LSTM with CRF layer trained only using the claim extraction dataset, Bi-LSTM with CRF layer where we use the conclusion class as a true label for claim, transfer learning of vanilla Bi-LSTM of discourse prediction, and transfer learning from Bi-LSTIM with CRF layers.

The rule-based model presented by Sateli et. al. [24] achieves F1-score of 0.35 and it is overall significantly lower than all other models. This shows that the rule-based approach built for the CS domain does not transfer well to the biomedical domain. Predicting last sentence as a claim gives a high precision but relatively low recall since claim can occur outside the last sentence. We observe that the transfer learning with fine-tuning of pretrained Bi-LSTM CRF model from discourse task achieves the best F1-score in claim extraction task in the biomedical domain. The GloVe pre-trained word vector gives a slightly better performance compared to the Pubmed vectors. These results suggest that overall discourse information is very relevant and improves performance substantially. The difference between F1-score of the best model without discourse and model with pre-trained discourse is 14%.

Error Analysis The best model correctly predicts all claims in abstracts for 60 percent of test articles. 31 percent of the articles has only one missclassified sentence. We now examine the rest of the cases where we misclassified more than one sentence. We found two types of errors. In one type of error, the model is unusually confident that the last sentence contains a claim. For example, for the article PMID: 25717311 the model wrongly predicts the sentence "Results are discussed in the context of developmental theories of visual processing and brain-based research on the role of visual skills in learning to read." as a claim when in reality is just a result statement. In another type of error, the model predicts with low probability that claims are in the middle of the abstract. For example, for the abstract PMID: 24806295, it misses predicting the following sentence as a claim "… Our results also show that aMCI forms a boundary between normal aging and AD and represents a functional continuum between healthy aging and the earliest signs of dementia.". even though it predicts the previous sentence as a claim correctly. Investigating the cause of this pattern and making modifications to address these issues are an area of future work.

7 Related Work

Previous research has analyzed news [14, 23], social media [9], persuasive essays and scientific articles [25], and Wikipedia [11, 27] to extract claims and premises, a task called Argumentation Mining. However, in the biomedical domain, the emphasis and data availability have been substantially less compared to these other domains. In particular, claim extraction in Biomedical domain has been dominated by rule-based techniques [24] and classic machine learning techniques [19, 13] whereas modern Argumentation Mining relies on deep learning technique such as in [7, 22, 3] which show that learning from weak supervision can help improve several text prediction tasks.

Current datasets for claim extraction in scientific domain are relatively small. Moreover, they are in constrained fields such as computer science, computational linguistic, and chemistry. [19] produced a CoreSC dataset containing 265 articles in physical chemistry and biochemistry. [10] produced the Dr. Inventor dataset which contains annotations of 40 computer graphics articles. [6] introduced a dataset of 75 articles for predicting discourse using articles from Pubmed. [24] presented a dataset with annotations for claims and contributions using full text of 30 articles from computer science domains. Biomedical literature may benefit from previous development, but it is unclear how well we can translate such domains into it.

8 Conclusion and Future work

Automatically extracting claims made in scientific articles is becoming increasingly important for information retrieval task. In this paper, we present a novel annotated dataset for claim extraction from the biomedical domain and several claim extraction models drawing from recent developments in weak supervision and transfer learning. We show that transfer learning helps with the claim extraction task. We also release the data, the code, and a web service demo of our work so that the community can improve upon it. Claim extraction is an important task to automate, and our work helps in this direction.

There are several improvements that we foresee being addressed in the future. Due to idiosyncrasies of the dataset, one of the problems is that our best model is slightly biased towards predicting the last few sentences of an abstract as claims. While this is mostly correct for biomedical papers, it might not be the structure of other fields, such as Computer Science or Social Science. Also, we are not certain about why the word vectors learned from PubMed are no better than the GloVe vectors. Perhaps, specializing the distributed representation of vectors to the domains should help more. In the future, we will collect and annotate data for other domains including Social Science and Computer Science, and improve our word vector representations.

Overall, with the substantial improvement we show here, we believe these models could be used in information extraction systems to support scholarly search engines such as Semantic Scholar or Pubmed. By making the dataset and code available to the community, we hope to invite other researchers in the quest of analyzing scientific publications at scale.

9 Acknowledgement

This work was sponsored by Systematizing Confidence in Open Research and Evidence (SCORE) project from Defense Advanced Research Projects Agency (DARPA) and the Allen Institute for Artificial Intelligence (Allen AI). We thank our annotators for their great effort.


  • [1] S. Arora, Y. Liang and T. Ma (2016) A simple but tough-to-beat baseline for sentence embeddings. Cited by: §5.2.
  • [2] R. Artstein and M. Poesio (2008) Inter-coder agreement for computational linguistics. Computational Linguistics 34 (4), pp. 555–596. Cited by: §2.
  • [3] I. Augenstein, S. Ruder and A. Søgaard (2018) Multi-task learning of pairwise sequence classification tasks over disparate label spaces. arXiv preprint arXiv:1802.09913. Cited by: §7.
  • [4] J. Carletta (1996) Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22, pp. 249–254. Cited by: §2.
  • [5] T. Chen, R. Xu, Y. He and X. Wang (2017) Improving sentiment analysis via sentence type classification using bilstm-crf and cnn. Expert Systems with Applications 72, pp. 221–230. Cited by: §1, §4.
  • [6] P. Dasigi, G. A. Burns, E. Hovy and A. de Waard (2017) Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks. arXiv preprint arXiv:1702.05398. Cited by: §1, §7.
  • [7] M. Dehghani, A. Severyn, S. Rothe and J. Kamps (2017) Learning to learn from weak supervision by full supervision. arXiv preprint arXiv:1711.11383. Cited by: §7.
  • [8] F. Dernoncourt and J. Y. Lee (2017) PubMed 200k rct: a dataset for sequential sentence classification in medical abstracts. arXiv preprint arXiv:1710.06071. Cited by: §1, §1, §4.
  • [9] M. Dusmanu, E. Cabrio and S. Villata (2017) Argument mining on twitter: arguments, facts and sources. In EMNLP, Cited by: §7.
  • [10] B. Fisas, H. Saggion and F. Ronzano (2015) On the discoursive structure of computer graphics research papers. In Proceedings of The 9th Linguistic Annotation Workshop, pp. 42–51. Cited by: §7.
  • [11] D. Fréard, A. Denis, F. Détienne, M. Baker, M. Quignard and F. Barcellini (2010) The role of argumentation in online epistemic communities: the anatomy of a conflict in wikipedia. In Proceedings of the 28th Annual European Conference on Cognitive Ergonomics, pp. 91–98. Cited by: §7.
  • [12] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. Liu, M. Peters, M. Schmitz and L. Zettlemoyer (2018) Allennlp: a deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. Cited by: §5.3.
  • [13] Y. Guo, A. Korhonen and T. Poibeau (2011) A weakly-supervised approach to argumentative zoning of scientific documents. In EMNLP, Cited by: §7.
  • [14] I. Habernal, J. Eckle-Kohler and I. Gurevych (2014) Argumentation mining on the web from information seeking perspective. In ArgNLP, Cited by: §7.
  • [15] M. Honnibal and I. Montani (2017) SpaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear. Cited by: §5.1.
  • [16] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 328–339. Cited by: §1.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
  • [18] A. Lauscher, G. Glavaš and S. P. Ponzetto (2018) An argument-annotated corpus of scientific publications. In Proceedings of the 5th Workshop on Argument Mining, pp. 40–46. Cited by: §2.
  • [19] M. Liakata, S. Teufel, A. Siddharthan and C. R. Batchelor (2010) Corpora for the conceptualisation and zoning of scientific papers.. In LREC, Cited by: §1, §7, §7.
  • [20] S. Moen and T. S. S. Ananiadou (2013) Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pp. 39–43. Cited by: §5.3.
  • [21] J. Pennington, R. Socher and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.3.
  • [22] A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey and C. Ré (2018) Training complex models with multi-task weak supervision. CoRR abs/1810.02840. Cited by: §7.
  • [23] C. Sardianos, I. M. Katakis, G. Petasis and V. Karkaletsis (2015) Argument extraction from news. In ArgMining@HLT-NAACL, Cited by: §7.
  • [24] B. Sateli and R. Witte (2015) Semantic representation of scientific literature: bringing claims, contributions and named entities onto the linked open data cloud. PeerJ Computer Science 1, pp. e37. Cited by: §1, §1, §2, §5.1, §6, §7, §7.
  • [25] C. Stab, C. Kirschner, J. Eckle-Kohler and I. Gurevych (2014) Argumentation mining in persuasive essays and scientific articles from the discourse structure perspective.. In ArgNLP, pp. 21–25. Cited by: §7.
  • [26] S. Teufel and M. Moens (2002) Summarizing scientific articles: experiments with relevance and rhetorical status. Computational linguistics 28 (4), pp. 409–445. Cited by: §1.
  • [27] J. Thorne, A. Vlachos, C. Christodoulopoulos and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT, Cited by: §1, §7.
  • [28] A. Titipat and D. Acuna (2015) Pubmed parser. GitHub. Note: \url Cited by: §2.

Appendix A Appendix

We show example screenshots of the output from claim prediction tool in figure 3. The first page of the annotation tool contains the description of the task. The second page contains title and abstract of the sample publications to be annotated. The abstract is pre-splitted into sentences which allows annotators to tag each sentence in the abstract. A screenshot of the annotation tool instruction and annotation example are shown in figures 4 and 5, respectively.

We report distribution of the relative position of the final label claims in figure 6. Around 55.3 percent of the claims are located in the last sentence and the rest are elsewhere in the abstract.

Figure 3: Screenshot of the output from Discourse and Claim Prediction tool. A sample discourse and claim prediction from CRISPR/Cas article published in Science (PMID: 23287718) using transfer learning of Bi-LSTM CRF model with fine-tuning.
Figure 4: Screenshot of the instruction of annotation tool. The annotators are presented by the task instruction, definition of a claim, and examples of annotated documents before the task.
Figure 5: Screenshot of the annotation tool. The abstracts are sampled from MEDLINE database, sentences are pre-splitted, and annotators can select which sentences are claims.
Figure 6: Distribution of relative position of annotated claims in the dataset. Around 55.3 percent of the claims are located in the last sentence and the rest are elsewhere in the abstract.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description