From POS tagging to dependency parsing for biomedical event extraction

From POS tagging to dependency parsing for biomedical event extraction

Dat Quoc Nguyen    Karin Verspoor
School of Computing and Information Systems
The University of Melbourne, Australia
{dqnguyen, karin.verspoor}@unimelb.edu.au
Abstract

Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance.

In this paper, we perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core NLP tasks of POS tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction.

From POS tagging to dependency parsing for biomedical event extraction


Dat Quoc Nguyen and Karin Verspoor School of Computing and Information Systems The University of Melbourne, Australia {dqnguyen, karin.verspoor}@unimelb.edu.au

1 Introduction

The biomedical literature, as captured in the parallel repositories of PubMed111https://www.ncbi.nlm.nih.gov/pubmed/ (abstracts) and PubMed Central222https://www.ncbi.nlm.nih.gov/pmc/ (full text articles), is growing at a remarkable rate of over one million publications per year. Effort to catalog the key research results in these publications demands automation (Baumgartner et al., 2007). Hence extraction of relations and events from the published literature has become a key focus of the biomedical natural language processing community.

Methods for information extraction typically make use of linguistic information, with a specific emphasis on the value of dependency parses. A number of linguistically-annotated resources, notably including the GENIA (Tateisi et al., 2005) and CRAFT (Verspoor et al., 2012) corpora, have been produced to support development and evaluation of natural language processing (NLP) tools over biomedical publications, based on the observation of the substantive differences between these domain texts and general English texts, as captured in resources such as the Penn Treebank (Marcus et al., 1993) that are standardly used for development and evaluation of syntactic processing tools. Recent work on biomedical relation extraction has highlighted the particular importance of syntactic information (Peng et al., 2017). Despite this, that work, and most other related work, has simply adopted a tool to analyze the syntactic characteristics of the biomedical texts without consideration of the appropriateness of the tool for these texts. A commonly used tool is the Stanford CoreNLP dependency parser (Chen and Manning, 2014), although domain-adapted parsers (e.g. McClosky and Charniak, 2008) are also sometimes used.

Prior work on the CRAFT treebank demonstrated substantial variation in the performance of syntactic processing tools for that data (Verspoor et al., 2012). Given the significant improvements in parsing performance in the last few years, thanks to renewed attention to the problem and exploration of neural methods, it is important to revisit whether the commonly used tools remain the best choices for syntactic analysis of biomedical texts. In this paper, we therefore investigate current state-of-the-art (SOTA) approaches to dependency parsing as applied to biomedical texts. We also present detailed results on the precursor task of part-of-speech (POS) tagging, since parsing depends heavily on POS tags. Finally, we study the impact of parser choice on biomedical event extraction, following the structure of the extrinsic parser evaluation shared task (EPE 2017) for biomedical event extraction (Björne et al., 2017). We find that differences in overall parser performance do not consistently explain differences in information extraction performance.

2 Experimental methodology

In this section, we present our empirical approach to evaluate different POS tagging and dependency parsing models on benchmark biomedical corpora.

2.1 Datasets

We use two biomedical corpora: GENIA (Tateisi et al., 2005) and CRAFT (Verspoor et al., 2012). GENIA includes abstracts from PubMed, while CRAFT includes full text publications. It has been observed that there are substantial linguistic differences between the abstracts and the corresponding full text publications (Cohen et al., 2010); hence it is important to consider both contexts when assessing BioNLP tools.

The GENIA corpus contains 18K sentences (486K words) from 1,999 Medline abstracts, which are manually annotated following the Penn Treebank (PTB) bracketing guidelines (Tateisi et al., 2005). On this treebank, we use the training, development and test split from McClosky (2010).333https://nlp.stanford.edu/~mcclosky/biomedical.html We then use the Stanford constituent-to-dependency conversion toolkit (v3.5.1) to generate dependency trees with basic Stanford dependencies (de Marneffe and Manning, 2008). The OOV rate is 4.4% for both development and test sets.

The CRAFT corpus includes 21K sentences (561K words) from 67 full-text biomedical journal articles.444http://bionlp-corpora.sourceforge.net/CRAFT These sentences are syntactically annotated using an extended PTB tag set. Given this extended set, the Stanford conversion toolkit is not suitable for generating dependency trees. Hence, a dependency treebank using the CoNLL 2008 dependencies (Surdeanu et al., 2008) was produced from the CRAFT treebank using ClearNLP (Choi and Palmer, 2012); we directly use this dependency treebank in our experiments. We use sentences from the first 6 files (PubMed IDs: 11532192–12585968) for development and sentences from the next 6 files (PubMed IDs: 12925238–15005800) for testing, while the the remaining 55 files are used for training. The OOV rates are 6.6% and 6.3% for the development and test sets, respectively.

Table 1 gives an overview of the experimental datasets, while Table 2 details corpus statistics.

Dataset Training Dev. Test
GENIA 1701 / 15820 148 / 1361 150 / 1360
CRAFT 55 / 18644 6 / 1280 6 / 1786
Table 1: The number of files / sentences in each dataset.
Dependency labels POS tags Length
GENIA CRAFT
Type % Type % Type % % Type %
advmod 2.3 ADV 4.0 CC 3.6 3.2 GENIA
amod 9.6 AMOD 1.9 CD 1.6 4.0 1-10 3.5
appos 1.2 CONJ 3.6 DT 7.6 6.6 11-20 31.0
aux 1.4 COORD 3.2 IN 12.9 11.3 21-30 35.7
auxpass 1.5 DEP 1.0 JJ 10.1 7.6 31-40 19.4
cc 3.5 LOC 1.7 NN 29.3 24.2 41-50 7.1
conj 3.9 NMOD 33.7 NNS 6.9 6.6 50 3.3
dep 2.1 OBJ 2.8 RB 2.5 2.4 - -
det 7.2 P 18.4 TO 1.6 0.6 CRAFT
dobj 3.1 PMOD 10.6 VB 1.1 1.1 1-10 17.8
mark 1.1 PRD 0.9 VBD 2.1 2.2 11-20 23.1
nn 11.6 PRN 1.9 VBG 1.0 1.1 21-30 25.2
nsubj 4.1 ROOT 3.9 VBN 3.1 3.8 31-40 17.5
nsubjpass 1.4 SBJ 4.9 VBP 1.4 1.1 41-50 9.3
num 1.2 SUB 0.9 VBZ 1.9 1.4 50 7.1
pobj 12.2 TMP 0.9 - - - - -
prep 12.3 VC 2.4 - - - - -
punct 10.4 - - - - - - -
root 3.8 - - - - - - -
Table 2: Statistics by the most frequent dependency and overlapped POS labels, and sentence length (i.e. number of words in the sentence). % and % denote the occurrence proportions in GENIA and CRAFT, respectively. The proportions of dependency distances of 1, 2, 3, 4, 5 and “ 5” words between a dependent and its head are 43.1%, 20.0%, 10.7%, 6.1%, 3.7% and 16.4% on GENIA, while they are 48.2%, 18.4%, 9.0%, 5.5%, 3.5% and 15.5% on CRAFT, respectively.

2.2 POS tagging models

We compare SOTA feature-based and neural network-based models for POS tagging over both GENIA and CRAFT. We consider the following:

  • MarMoT is a well-known generic CRF framework as well as a leading POS and morphological tagger (Mueller et al., 2013).555http://cistern.cis.lmu.de/marmot

  • NLP4J’s POS tagging model (NLP4J-POS) is a dynamic feature induction model that automatically optimizes feature combinations (Choi, 2016).666https://emorynlp.github.io/nlp4j/components/part-of-speech-tagging.html NLP4J is the successor of ClearNLP.

  • BiLSTM-CRF is a sequence labeling model which extends a standard BiLSTM model with a CRF layer (Huang et al., 2015).

  • BiLSTM-CRF+CNN-char extends the BiLSTM-CRF model with character-level word embeddings. For each word token, its character-level word embedding is derived by applying a CNN to the word’s character sequence (Ma and Hovy, 2016).

  • BiLSTM-CRF+LSTM-char also extends the BiLSTM-CRF model with character-level word embeddings, which are derived by applying a BiLSTM to each word’s character sequence (Lample et al., 2016).

For the three BiLSTM-CRF-based sequence labeling models, we use a performance-optimized implementation (Reimers and Gurevych, 2017).777https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf As detailed later in Section 3.1, we use NLP4J-POS to predict POS tags on development and test sets and perform 20-way jackknifing (Koo et al., 2008) to generate POS tags on the training set for dependency parsing.

2.3 Dependency parsers

Our second study assesses the performance of SOTA dependency parsers, as well as commonly used parsers, on biomedical texts. Prior work on the CRAFT treebank identified the domain-retrained ClearParser (Choi and Palmer, 2011), now part of the NLP4J toolkit (Choi et al., 2015), as a top-performing system for dependency parsing over that data. It remains the best performing non-neural model for dependency parsing. In particular, we compare the following parsers:

  • The Stanford neural network dependency parser (Stanford-NNdep) is a greedy transition-based parsing model which concatenates word, POS tag and arc label embeddings into a single vector, and then feeds this vector into a multi-layer perceptron with one hidden layer for transition classification (Chen and Manning, 2014).888https://nlp.stanford.edu/software/nndep.shtml

  • NLP4J’s dependency parsing (NLP4J-dep) model is a transition-based parser with a selectional branching method that uses confidence estimates to decide when employing a beam (Choi and McCallum, 2013).999https://emorynlp.github.io/nlp4j/components/dependency-parsing.html

  • jPTDP v1 is a joint model for POS tagging and dependency parsing,101010https://github.com/datquocnguyen/jPTDP which uses BiLSTMs to learn feature representations shared between POS tagging and dependency parsing (Nguyen et al., 2017). jPTDP can be viewed as an extension of the graph-based dependency parser bmstparser (Kiperwasser and Goldberg, 2016), replacing POS tag embeddings with LSTM-based character-level word embeddings. For jPTDP, we train with gold standard POS tags.

  • The Stanford “Biaffine” parser v1 extends bmstparser with biaffine classifiers to predict dependency arcs and labels (Dozat and Manning, 2017), obtaining the highest parsing result to date on the benchmark English PTB. The Stanford biaffine parser v2 (Dozat et al., 2017), further extends v1 with LSTM-based character-level word embeddings, obtaining the best result on the CoNLL 2017 shared task on multilingual parsing (Zeman et al., 2017).111111https://github.com/tdozat/Parser-v2 We use v2 in our experiments.

2.4 Implementation details

We use the training set to learn model parameters while we tune the model hyper-parameters on the development set. Then we report final evaluation results on the test set. The metric for POS tagging is the accuracy while the metrics for dependency parsing are the labeled attachment score (LAS) and unlabeled attachment score (UAS).

For NLP4J-POS, BiLSTM-CRF-based models, Stanford-NNdep, jPTDP and Stanford-Biaffine which utilizes pre-trained word embeddings, we employ 200-dimensional pre-trained word vectors from Chiu et al. (2016). These pre-trained vectors were obtained by training the Word2Vec skip-gram model (Mikolov et al., 2013) on a PubMed abstract corpus of 3 billion word tokens.

For traditional feature-based models MarMoT, NLP4J-POS and NLP4J-dep, we use their original pure Java implementations with default hyper-parameter settings.

For BiLSTM-CRF-based models, we use default hyper-parameters provided by Reimers and Gurevych (2017) with the following exceptions: for training, we use Nadam (Dozat, 2016) and run for 50 epochs. We perform a grid search of hyper-parameters to select the number of BiLSTM layers from and the number of LSTM units in each layer from . Early stopping is applied when no performance improvement on the development set is obtained after 10 contiguous epochs.

For Stanford-NNdep, we select the wordCutOff from and the size of the hidden layer from and fix other hyper-parameters with their default values.

For jPTDP, we use 50-dimensional character embeddings and fix the initial learning rate at 0.0005. We also fix the number of BiLSTM layers at 2 and select the number of LSTM units in each layer from . Other hyper-parameters are set at their default values.

For Stanford-Biaffine, we use default hyper-parameter values (Dozat et al., 2017). These default values can be considered as optimal ones as they helped producing the highest scores for 57 test sets (including English test sets) and second highest scores for 14 test sets over total 81 test sets across 45 different languages at the CoNLL 2017 shared task (Zeman et al., 2017).

3 Main results

3.1 POS tagging results

Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each biomedical corpus.

The penultimate row presents the result of the pre-trained Stanford tagging model english-bidirectional-distsim.tagger (Toutanova et al., 2003), trained on a larger corpus of about 40K sentences of English PTB WSJ text; given the use of newswire training data is unsurprising that this model produces lower accuracy than the retrained tagging models. The final row includes published results of the GENIA POS tagger (Tsuruoka et al., 2005), when trained on 90% of the GENIA corpus (cf. our 85% training set). It does not support a (re)-training process.

Model GENIA CRAFT
MarMoT 98.61 97.07
jPTDP-v1 98.66 97.24
NLP4J-POS 98.80 97.43
BiLSTM-CRF 98.44 97.25
     + CNN-char 98.89 97.51
     + LSTM-char 98.85 97.56
Stanford tagger [] 98.37 _
GENIA tagger [] 98.49 _
Table 3: POS tagging accuracies on the test set and gold tokenization. [] denotes a result with a pre-trained POS tagger.

In general, we find that the six retrained models produce competitive results. BiLSTM-CRF and MarMoT obtain the lowest scores on GENIA and CRAFT, respectively. jPTDP obtains a similar score to MarMoT on GENIA and similar score to BiLSTM-CRF on CRAFT. In particular, MarMoT obtains accuracy results at 98.61% and 97.07% on GENIA and CRAFT, which are about 0.2% and 0.4% absolute lower than NLP4J-POS, respectively. NLP4J-POS uses additional features based on Brown clusters (Brown et al., 1992) and pre-trained word vectors learned from a large external corpus, providing useful extra information.

BiLSTM-CRF obtains accuracies of 98.44% on GENIA and 97.25% on CRAFT. Using character-level word embeddings helps to produce about 0.5% and 0.3% absolute improvements to BiLSTM-CRF on GENIA and CRAFT, respectively, resulting in the highest accuracies on both experimental corpora. Note that for PTB, CNN-based character-level word embeddings (Ma and Hovy, 2016) only provided a 0.1% improvement to BiLSTM-CRF (Huang et al., 2015). The larger improvements on GENIA and CRAFT show that character-level word embeddings are specifically useful to capture rare or unseen words in biomedical text data. Character-level word embeddings are useful for morphologically rich languages (Plank et al., 2016; Nguyen et al., 2017), and although English is not morphologically rich, the biomedical domain contains a wide variety of morphological variants of domain-specific terminology (Liu et al., 2012). Words tagged incorrectly are largely associated with gold tags NN, JJ and NNS; many are abbreviations which are also out-of-vocabulary. It is typically difficult for character-level word embeddings to capture those unseen words.

On both GENIA and CRAFT, BiLSTM-CRF with character-level word embeddings obtains the highest accuracy scores. These are just 0.1% absolute higher than the accuracies of NLP4J-POS. In addition, we find that NLP4J-POS obtains 30-time faster training and testing speed. Hence for the dependency parsing task, we use NLP4J-POS to perform 20-way jackknifing (Koo et al., 2008) to generate POS tags on training data and to predict POS tags on development and test sets.

System With pun. Without pun.
LAS UAS LAS UAS

GENIA

Pre-trained

BLLIP+Bio 88.38 89.92 88.76 90.49
Stanford-NNdep 86.79 88.13 87.43 88.91
Stanford-Biaffine-v1 84.72 87.89 84.94 88.45
Stanford-NNdep [] 86.66 88.22 87.31 89.02
Stanford-Biaffine-v1 [] 84.69 87.95 84.92 88.55

GENIA

Retrained

Stanford-Biaffine-v2 91.04 92.31 91.23 92.64
jPTDP-v1 90.01 91.46 90.27 91.89
NLP4J-dep 88.20 89.45 88.87 90.25
Stanford-NNdep 87.02 88.34 87.56 89.02

CRAFT

Retrained

Stanford-Biaffine-v2 90.41 92.02 90.77 92.67
jPTDP-v1 88.27 90.08 88.66 90.79
NLP4J-dep 86.98 88.85 87.62 89.80
Stanford-NNdep 84.76 86.64 85.59 87.81
Table 4: Parsing results with predicted POS tags and gold tokenization. Without pun. refers to results excluding punctuation and other symbols from evaluation. [] denotes the use of the pre-trained Stanford tagger for predicting POS tags on test set, instead of using the retrained NLP4J-POS model. Score differences between the “retrained” parsers on both corpora are significant at using McNemar’s test.

3.2 Overall dependency parsing results

We present the LAS and UAS scores of different parsing models in Table 4. The first five rows show parsing results on the GENIA test set of “pre-trained” constituent and dependency parsers. The first row shows scores of BLLIP+Bio, the BLLIP reranking constituent parser (Charniak and Johnson, 2005) with an improved self-trained biomedical parsing model (McClosky, 2010). We use the Stanford conversion toolkit (v3.5.1) to generate dependency trees with the basic Stanford dependencies and use the data split on GENIA as McClosky (2010) used for training the improved self-trained biomedical parsing model, therefore parsing scores are comparable. Rows 2-3 present scores of the pre-trained Stanford NNdep and Biaffine v1 models with POS tags predicted by NLP4J-POS, while rows 4-5 present scores of these pre-trained models with POS tags predicted by the pre-trained Stanford tagger (Toutanova et al., 2003). Both pre-trained NNdep and Biaffine models were trained on a dependency treebank converted from the English PTB.

The remaining rows show parsing results of our retrained dependency parsing models. On GENIA, among pre-trained models, BLLIP obtains highest results. This model, unlike the other pre-trained models, was trained using GENIA, so this result is unsurprising. Pre-trained NNdep and Biaffine models result in no significant performance differences irrespective of the source of POS tags.

Regarding the retrained models, on both GENIA and CRAFT, Stanford-Biaffine achieves the highest parsing results. Stanford-NNdep obtains the lowest scores; about 3.5% and 5% absolute lower than Stanford-Biaffine on GENIA and CRAFT, respectively. jPTDP is ranked second, obtaining about 1% and 2% lower scores than Stanford-Biaffine and 1.5% and 1% higher scores (without punctuation) than NLP4J-dep on GENIA and CRAFT, respectively.

Figure 1: LAS (F1) scores by sentence length and dependency distance.

3.3 Parsing result analysis

Here we present a detailed analysis of the parsing results obtained by the retrained models. For simplicity, the following more detailed analyses report LAS scores, computed without punctuation. Using UAS scores or computing with punctuation does not reveal any additional information.

3.3.1 Sentence length

Figure 1 presents LAS scores by sentence length in bins of length 10. As expected, all parsers produce better results for shorter sentences on both corpora; longer sentences are likely to have longer dependencies which are typically harder to predict precisely. Scores drop by about 10% for sentences longer than 50 words, relative to short sentences words. Exceptionally, on GENIA we find lower scores for the shortest sentences than for the sentences from 11 to 20 words. This is probably because abstracts tend not to contain short sentences: (i) as shown in Table 2, the proportion of sentences in the first bin is very low at 3.5% on GENIA (cf. 17.8% on CRAFT), and (ii) sentences in the first bin on GENIA are relatively long, with an average length of 9 words (cf. 5 words in CRAFT).

3.3.2 Dependency distance

Figure 1 also shows LAS (F1) scores corresponding to the dependency distance , between a dependent and its head , where and are consecutive indices of words in a sentence. Short dependencies are often modifiers of nouns such as determiners or adjectives or pronouns modifying their direct neighbors, while longer dependencies typically represent modifiers of the root or the main verb (McDonald and Nivre, 2007). All parsers obtain higher scores for left dependencies than for right dependencies. This is not completely unexpected as English is strongly head-initial. In addition, the gaps between LSTM-based models (i.e. Stanford-Biaffine and jPTDP) and non-LSTM models (i.e. NLP4J-dep and Stanford-NNdep) are larger for the long dependencies than for the shorter ones, as LSTM architectures can preserve long range information (Graves, 2008).

Type
Prop. LAS Prop. LAS Prop. LAS
advmod 7.0 93.92 4.2 90.91 4.7 87.10
amod 5.3 77.37 8.3 69.70 18.1 82.35
det 4.3 83.93 17.1 91.04 20.8 90.44
mark 15.1 95.63 10.9 93.02 6.3 95.12
nn 4.8 79.03 15.9 80.00 16.7 74.31
nsubj 28.2 93.79 18.7 93.88 15.3 96.00
nsubjpass 16.0 95.38 11.6 92.31 3.8 88.00
prep 12.2 94.27 6.6 96.15 2.6 88.24
Table 5: LAS (F1) scores of jPTDP on GENIA, by frequent dependency labels in the left dependencies. “Prop.” denotes the occurrence proportion in each dependency distance bin.

On both corpora, higher scores are also associated with shorter distances. There is one surprising exception: on GENIA, in distance bins of , and , Stanford-Biaffine and jPTDP obtain higher scores for longer distances. This may result from the structural characteristics of sentences in the GENIA corpus. Table 5 details the scores of jPTDP in terms of the most frequent dependency labels in these left-most dependency bins. We find amod and nn are the two most difficult to predict dependency relations. They appear much more frequently in the bins and than in bin , explaining the higher overall score for bin .

3.3.3 Dependency label

Tables 6 and 7 present LAS scores for the most frequent dependency relation types on GENIA and CRAFT, respectively. In most cases, Stanford-Biaffine obtains the highest score for each relation type on both corpora with the following exceptions: on GENIA, jPTDP gets the highest results to aux, dep and nn (as well as nsubjpass), while NLP4J-dep and NNdep obtain the highest scores for auxpass and num, respectively. On GENIA the labels associated with the highest average LAS scores (generally ) are amod, aux, auxpass, det, dobj, mark, nsubj, nsubjpass, pobj and root whereas on CRAFT they are NMOD, OBJ, PMOD, PRD, ROOT, SBJ, SUB and VC. These labels either correspond to short dependencies (e.g. aux, auxpass and VC), have strong lexical indications (e.g. det, pobj and PMOD), or occur very often (e.g. amod, subj, NMOD and SBJ).

Those relation types with the lowest LAS scores (generally ) are dep on GENIA and DEP, LOC, PRN and TMP on CRAFT; dep/DEP are very general labels while LOC, PRN and TMP are among the least frequent labels. Those types also associate to the biggest variation of obtained accuracy across parsers (). In addition, the coordination-related labels cc, conj/CONJ and COORD show large variation across parsers. These 9 mentioned relation labels generally correspond to long dependencies. Therefore, it is not surprising that BiLSTM-based models Stanford-Biaffine and jPTDP can produce much higher accuracies on these labels than non-LSTM models NLP4J-dep and NNdep.

The remaining types are either relatively rare labels (e.g. appos, num and AMOD) or more frequent labels but with a varied distribution of dependency distances (e.g. advmod, nn, and ADV).

Type Biaffine jPTDP NLP4J NNdep Avg.
advmod 87.38 86.77 87.26 83.86 86.32
amod 92.41 92.21 90.59 90.94 91.54
appos 84.28 83.25 80.41 77.32 81.32
aux 98.74 99.28 98.92 97.66 98.65
auxpass 99.32 99.32 99.49 99.32 99.36
cc 89.90 86.38 82.21 79.33 84.46
conj 83.82 78.64 73.32 69.40 76.30
dep 40.49 41.72 40.04 31.66 38.48
det 97.16 96.68 95.46 95.54 96.21
dobj 96.49 95.87 94.90 92.18 94.86
mark 94.68 90.38 89.62 90.89 91.39
nn 90.07 90.25 88.22 88.97 89.38
nsubj 95.83 94.71 93.18 90.75 93.62
nsubjpass 95.56 95.56 92.05 90.94 93.53
num 89.14 85.97 90.05 90.27 88.86
pobj 97.04 96.54 96.54 95.13 96.31
prep 90.54 89.93 89.18 88.31 89.49
root 97.28 97.13 94.78 92.87 95.52
Table 6: LAS by the basic Stanford dependency labels on GENIA. “Avg.” denotes the averaged score of the four dependency parsers.
Type Biaffine jPTDP NLP4J NNdep Avg.
ADV 79.20 77.53 75.58 71.64 75.99
AMOD 86.43 83.45 85.00 82.98 84.47
CONJ 91.73 88.69 85.42 83.34 87.30
COORD 88.47 84.75 79.42 76.38 82.26
DEP 73.23 67.96 62.83 52.43 64.11
LOC 70.70 68.91 68.64 61.35 67.40
NMOD 92.55 91.19 90.77 90.04 91.14
OBJ 96.51 94.53 93.85 91.34 94.06
PMOD 96.30 94.85 94.52 93.44 94.78
PRD 93.96 90.11 92.49 90.66 91.81
PRN 62.11 61.30 49.26 46.96 54.91
ROOT 98.15 97.20 95.24 91.27 95.47
SBJ 95.87 93.03 91.82 90.11 92.71
SUB 95.18 91.81 91.81 89.64 92.11
TMP 78.76 68.81 65.71 59.73 68.25
VC 98.84 97.50 98.09 96.09 97.63
Table 7: LAS by the CoNLL 2008 dependency labels on CRAFT.

3.3.4 POS tag of the dependent

Table 8 analyzes the LAS scores by the most frequent POS tags (across two corpora) of the dependent. Stanford-Biaffine achieves the highest scores on all these tags except TO where the traditional feature-based model NLP4J-dep obtains the highest score (TO is relatively rare tag in GENIA and is the least frequent tag in CRAFT among tags listed in Table 8). Among listed tags VBG is the least and second least frequent one in GENIA and CRAFT, respectively, and generally associates to longer dependency distances. So, it is reasonable that the lowest scores we obtain on both corpora are accounted for by VBG. The coordinating conjunction tag CC also often corresponds to long dependencies, thus resulting in biggest ranges across parsers on both GENIA and CRAFT. The results for CC are consistent with the results obtained for the dependency labels cc in Table 6 and COORD in Table 7 because they are coupled to each other.

Type GENIA CRAFT
Biaffine jPTDP NLP4J NNdep Biaffine jPTDP NLP4J NNdep
CC 89.71 86.70 82.75 80.20 89.01 85.45 79.99 77.45
CD 81.83 79.30 79.78 79.30 88.03 85.17 84.22 79.77
DT 95.31 95.09 93.99 93.08 98.27 97.39 97.18 96.77
IN 90.57 89.50 88.41 87.58 81.79 79.32 78.43 75.97
JJ 90.17 89.35 88.30 87.76 94.24 92.91 92.50 91.70
NN 90.69 89.92 88.26 87.62 91.24 89.28 88.32 87.48
NNS 93.31 92.32 91.33 87.91 95.07 92.57 90.91 88.30
RB 88.31 86.92 87.73 84.61 84.41 81.98 82.13 76.99
TO 90.97 91.50 92.04 88.14 90.16 85.83 90.55 83.86
VB 89.68 87.84 85.09 83.49 98.86 98.86 98.67 96.38
VBD 94.60 93.85 90.97 90.34 94.74 93.21 90.03 86.86
VBG 82.67 79.47 79.20 72.27 85.51 81.33 81.15 75.57
VBN 91.42 90.53 88.02 85.51 93.22 91.24 90.25 88.04
VBP 94.46 93.88 92.54 90.63 93.54 91.18 88.98 84.09
VBZ 96.39 94.83 93.57 92.48 93.42 88.77 87.67 84.25
Table 8: LAS by POS tags.

On the remaining POS tags, we generally find similar patterns across parsers and corpora, except for IN and VB where parsers produce 8+% higher scores for IN on GENIA than on CRAFT, and vice versa producing 9+% lower scores for VB on GENIA. This is because on GENIA, IN is mostly coupled with the dependency label prep at a rate of 90% (thus their corresponding LAS scores in tables 8 and 6 are consistent), while on CRAFT IN is coupled to a more varied distribution of dependency labels such as ADV with a rate at 20%, LOC at 14%, NMOD at 40% and TMP at 5%. Regarding VB, on CRAFT it usually associates to a short dependency distance of 1 word (i.e. head and dependent words are next to each other) with a rate at 80%, and to a distance of 2 words at 15%, while on GENIA it associates with longer dependency distances with a rate at 17% for the distance of 1 word, 31% for the distance of 2 words and 34% for a distance of words. So, parsers obtain much higher scores for VB on CRAFT than on GENIA.

ID Form Gold Prediction
POS H. DEP POS H. DEP
19 both CC 24 preconj CC 21 preconj
20 the DT 24 det DT 21 dep
21 POU(S) JJ 24 amod NN 18 pobj
22 and CC 21 cc CC 21 cc
23 POU(H) NN 21 conj NN 21 conj
24 domains NNS 18 pobj NNS 21 dep
23 the DT 26 det DT 27 det
24 Oct-1-responsive JJ 26 amod JJ 27 amod
25 octamer NN 26 nn NN 27 nn
26 sequence NN 22 pobj NN 27 nn
27 ATGCAAAT NN 26 dep NN 22 pobj
Table 9: Error examples. “H.” denotes the head of the current word.

3.3.5 Error analysis

We analyze parsing errors and find that there are few common error patterns across parsers. Table 9 presents two typical error types. The first one is related to incorrect POS tag prediction. For the first example, the word token “domains” is the head of the phrase “both the POU(S) and POU(H) domains.” We also have two OOV word tokens “POU(S)” and “POU(H)” which abbreviate “POU-specific” and “POU homeodomain”, respectively. NLP4J-POS (as well as all other POS taggers) produced an incorrect tag of NN rather than adjective (JJ) for “POU(S)”. As “POU(S)” is predicted to be a noun, all parsers make an incorrect prediction that it is the phrasal head.

The second error type occurs on noun phrases such as “the Oct-1-responsive octamer sequence ATGCAAAT” and “the herpes simplex virus Oct-1 coregulator VP16”, where the second to last noun (i.e. “sequence” and “coregulator”) is considered to be the phrasal head, rather than the last noun. However, such phrases are relatively rare and all parsers predict the last noun as the head.

4 Parser comparison on event extraction

We present an extrinsic evaluation of the four dependency parsers for the downstream task of biomedical event extraction.

4.1 Evaluation setup

Previously, Miwa et al. (2010) adopted the BioNLP 2009 shared task on biomedical event extraction (Kim et al., 2009) to compare the task-oriented performance of six “pre-trained” parsers with 3 different types of dependency representations. However, their evaluation setup requires use of a currently unavailable event extraction system. Fortunately, the extrinsic parser evaluation (EPE 2017) shared task aimed to evaluate different dependency representations by comparing their performance on downstream tasks (Oepen et al., 2017), including a biomedical event extraction task (Björne et al., 2017). We thus follow the experimental setup used there; employing the Turku Event Extraction System (TEES, Björne et al. (2009)) to assess the impact of parser differences on biomedical relation extraction.121212https://github.com/jbjorne/TEES/wiki/EPE-2017

EPE 2017 uses the BioNLP 2009 shared task dataset (Kim et al., 2009), which was derived from the GENIA treebank corpus (800, 150 and 260 abstract files used for BioNLP 2009 training, development and test, respectively).131313678 of 800 training, 132 of 150 development and 248 of 260 test files are included in the GENIA treebank training set. We only need to provide parses of raw texts using the pre-processed tokenized and sentence-segmented data provided by the EPE 2017 shared task. For the Stanford-Biaffine, NLP4J-dep and Stanford-NNdep parsers that require predicted POS tags, we use the retrained NLP4J-POS model to generate POS tags. We then produce parses using retrained dependency parsing models.

TEES is then trained for the BioNLP 2009 Task 1 using the training data, and is evaluated on the development data (gold event annotations are only available to public for training and development sets). To obtain test set performance, we use an online evaluation system.141414The online evaluation system for the BioNLP 2009 shared task is currently not available. Therefore, we employ the online evaluation system (http://bionlp-st.dbcls.jp/GE/2011/eval-test/eval.cgi) for the BioNLP 2011 shared task (Kim et al., 2011) with the “abstracts only” option. The score is reported using the approximate span & recursive evaluation strategy.

4.2 Impact of parsing on event extraction

Table 10 presents the intrinsic UAS and LAS (F1) scores on the pre-processed segmented BioNLP 2009 development sentences (i.e. scores with respect to (w.r.t.) predicted segmentation), which contain event interactions. These scores are higher than those presented in Table 4 because most part of the BioNLP 2009 dataset is extracted from the GENIA treebank training set. Although gold event annotations in the BioNLP 2009 test set are not available to public, it is likely that we would obtain the similar intrinsic scores on the pre-segmented test sentences containing event interactions.

Table 11 compares parsers w.r.t. the EPE 2017 biomedical event extraction task. The first row presents the score of the Stanford&Paris team (Schuster et al., 2017); the highest official score obtained on the test set. Their system used the Stanford-Biaffine parser (v2) trained on a dataset combining PTB, Brown corpus, and GENIA treebank data.151515The EPE 2017 shared task focused on evaluating different dependency representations in downstream tasks, not on comparing different parsers. Therefore each participating team employed only one parser, either a dependency graph or tree parser. Only the Stanford&Paris team Schuster et al. (2017) employ GENIA data, obtaining the highest biomedical event extraction score. The second row presents our score for the pre-trained BLLIP+Bio model; remaining rows show scores using re-trained parsing models.

Metric Biaffine jPTDP NLP4J NNdep
UAS 95.51 93.14 92.50 91.02
LAS 94.82 92.18 91.96 90.30
Table 10: UAS and LAS (F1) scores of re-trained models on the pre-segmented BioNLP-2009 development sentences which contain event interactions. Scores are computed on all tokens using the evaluation script from the CoNLL 2017 shared task (Zeman et al., 2017).
System Development Test
R P F R P F
Stanford&Paris 49.92 55.75 52.67 45.03 56.93 50.29
BLLIP+Bio 47.90 61.54 53.8752.35 41.45 60.45 49.1849.19

GENIA

Stanford-Biaffine-v2 50.53 56.47 53.3453.18 43.87 56.36 49.3449.47
jPTDP-v1 49.30 58.58 53.5452.08 42.11 54.94 47.6848.88
NLP4J-dep 51.93 55.15 53.4952.20 45.88 55.53 50.2549.08
Stanford-NNdep 46.79 60.36 52.7151.38 40.16 59.75 48.0448.51

CRAFT

Stanford-Biaffine-v2 49.47 57.98 53.3952.98 42.08 58.65 49.0049.84
jPTDP-v1 49.36 58.22 53.4252.01 40.82 58.57 48.1149.57
NLP4J-dep 48.91 53.13 50.9351.03 41.95 51.88 46.3947.46
Stanford-NNdep 46.34 56.83 51.0551.01 38.87 59.64 47.0746.38
Table 11: Biomedical event extraction results. The subscripts denote results for which TEES is trained without the dependency labels.

The results for parsers trained with the GENIA treebank (Rows 1-6, Table 11) are generally higher than for parsers trained on CRAFT. This is logical because the BioNLP 2009 shared task dataset was a subset of the GENIA corpus. However, we find that the differences in intrinsic parsing results as presented in tables 4 and 10 do not consistently explain the differences in extrinsic biomedical event extraction performance, extending preliminary related observations in prior work (MacKinlay et al., 2013). Among the four dependency parsers trained on GENIA, Stanford-Biaffine, jPTDP and NLP4J-dep produce similar event extraction scores on the development set, while on the the test set jPTDP and NLP4J-dep obtain the lowest and highest scores, respectively.

Table 11 also summarizes the results with the dependency structures only (i.e. results without dependency relation labels; replacing all predicted dependency labels by “UNK” before training TEES). In most cases, compared to using dependency labels, event extraction scores drop on the development set (except NLP4J-dep trained on CRAFT), while they increase on the test set (except NLP4J-dep trained on GENIA and Stanford-NNdep trained on CRAFT). Without dependency labels, better event extraction scores on the development set corresponds to better scores on the test set. In addition, the differences in these event extraction scores without dependency labels are more consistent with the parsing performance differences than the scores with dependency labels.

These findings show that variations in dependency representations strongly affect event extraction performance. Some (predicted) dependency labels are likely to be particularly useful for extracting events, while others hurt performance. We leave analyzing those factors to future work.

5 Conclusion

We have presented a detailed empirical study comparing SOTA traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context.161616We make the retrained models available at https://github.com/datquocnguyen/BioNLP. In general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. In particular, BiLSTM-CRF-based models with character-level word embeddings produce highest POS tagging accuracies which are slightly better than NLP4J-POS, while the Stanford-Biaffine parsing model obtains significantly better result than other parsing models.

We also investigate the influence of parser selection for a biomedical event extraction downstream task, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance. Whether this pattern holds for other information extraction tasks is left as future work.

Acknowledgment

This work was supported by the ARC Discovery Project DP150101550 and ARC Linkage Project LP160101469.

References

  • Baumgartner et al. (2007) William Baumgartner, K. Cohen, Lynne Fox, George Acquaah-Mensah, and Lawrence Hunter. 2007. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23(13):i41–i48.
  • Björne et al. (2017) Jari Björne, Filip Ginter, and Tapio Salakoski. 2017. EPE 2017: The Biomedical Event Extraction Downstream Application. In Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation, pages 17–24.
  • Björne et al. (2009) Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, and Tapio Salakoski. 2009. Extracting complex biological events with rich graph-based feature sets. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 10–18.
  • Brown et al. (1992) Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based N-gram Models of Natural Language. Comput. Linguist., 18(4):467–479.
  • Charniak and Johnson (2005) Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 173–180.
  • Chen and Manning (2014) Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 740–750.
  • Chiu et al. (2016) Billy Chiu, Gamal Crichton, Anna Korhonen, and Sampo Pyysalo. 2016. How to Train good Word Embeddings for Biomedical NLP. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pages 166–174.
  • Choi (2016) Jinho D. Choi. 2016. Dynamic Feature Induction: The Last Gist to the State-of-the-Art. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 271–281.
  • Choi and McCallum (2013) Jinho D. Choi and Andrew McCallum. 2013. Transition-based Dependency Parsing with Selectional Branching. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1052–1062.
  • Choi and Palmer (2011) Jinho D. Choi and Martha Palmer. 2011. Getting the most out of transition-based dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 687–692.
  • Choi and Palmer (2012) Jinho D. Choi and Martha Palmer. 2012. Guidelines for the CLEAR Style Constituent to Dependency Conversion. Technical report, Institute of Cognitive Science, University of Colorado Boulder.
  • Choi et al. (2015) Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 387–396.
  • Cohen et al. (2010) K.B. Cohen, H. Johnson, K. Verspoor, C. Roeder, and L. Hunter. 2010. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics, 11(1):492.
  • Dozat (2016) Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In Proceedings of the ICLR 2016 Workshop Track.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing. In Proceedings of the 5th International Conference on Learning Representations.
  • Dozat et al. (2017) Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 20–30.
  • Graves (2008) Alex Graves. 2008. Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University Munich.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint, arXiv:1508.01991.
  • Kim et al. (2009) Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 Shared Task on Event Extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9.
  • Kim et al. (2011) Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, Ngan Nguyen, and Jun’ichi Tsujii. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop, pages 1–6.
  • Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Transactions of the Association for Computational Linguistics, 4:313–327.
  • Koo et al. (2008) Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 595–603.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270.
  • Liu et al. (2012) Haibin Liu, Tom Christiansen, William A. Baumgartner, and Karin Verspoor. 2012. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3(1):3.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074.
  • MacKinlay et al. (2013) Andrew MacKinlay, David Martinez, Antonio Jimeno Yepes, Haibin Liu, W John Wilbur, and Karin Verspoor. 2013. Extracting biomedical events and modifications using subgraph matching with noisy training data. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 35–44.
  • Marcus et al. (1993) Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
  • de Marneffe and Manning (2008) Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The Stanford Typed Dependencies Representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8.
  • McClosky (2010) David McClosky. 2010. Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing. Ph.D. thesis, Department of Computer Science, Brown University.
  • McClosky and Charniak (2008) David McClosky and Eugene Charniak. 2008. Self-training for biomedical parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pages 101–104.
  • McDonald and Nivre (2007) Ryan McDonald and Joakim Nivre. 2007. Characterizing the Errors of Data-Driven Dependency Parsing Models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 122–131.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
  • Miwa et al. (2010) Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, and Jun’ichi Tsujii. 2010. Evaluating Dependency Representations for Event Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 779–787.
  • Mueller et al. (2013) Thomas Mueller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332.
  • Nguyen et al. (2017) Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2017. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 134–142.
  • Oepen et al. (2017) Stephan Oepen, Lilja Ovrelid, Jari Björne, Richard Johansson, Emanuele Lapponi, Filip Ginter, and Erik Velldal. 2017. The 2017 Shared Task on Extrinsic Parser Evaluation Towards a Reusable Community Infrastructure. In Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation, pages 1–16.
  • Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-Sentence N-ary Relation Extraction with Graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101–115.
  • Plank et al. (2016) Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412–418.
  • Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338–348.
  • Schuster et al. (2017) Sebastian Schuster, Eric De La Clergerie, Marie Candito, Benoit Sagot, Christopher D. Manning, and Djamé Seddah. 2017. Paris and Stanford at EPE 2017: Downstream Evaluation of Graph-based Dependency Representations. In Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation, pages 47–59.
  • Surdeanu et al. (2008) Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluís Màrquez, and Joakim Nivre. 2008. The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 159–177.
  • Tateisi et al. (2005) Yuka Tateisi, Akane Yakushiji, Tomoko Ohta, and Jun’ichi Tsujii. 2005. Syntax Annotation for the GENIA Corpus. In Proceedings of the Second International Joint Conference on Natural Language Processing: Companion Volume, pages 220–225.
  • Toutanova et al. (2003) Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, pages 173–180.
  • Tsuruoka et al. (2005) Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii. 2005. Developing a robust part-of-speech tagger for biomedical text. In Advances in Informatics, pages 382–392.
  • Verspoor et al. (2012) Karin Verspoor, Kevin Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe Roeder, Jinho D. Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A. Baumgartner, Michael Bada, Martha Palmer, and Lawrence E. Hunter. 2012. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics, 13(1):207.
  • Zeman et al. (2017) Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, et al. 2017. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254357
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description