Joint Extraction of Entities and Relations Based on
a Novel Decomposition Strategy
Joint extraction of entities and relations aims to detect entity pairs along with their relations using a single model. Prior works typically solve this task in the extract-then-classify or unified labeling manner. However, these methods either suffer from the redundant entity pairs, or ignore the important inner structure in the process of extracting entities and relations. To address these limitations, in this paper, we first decompose the joint extraction task into two inner-related subtasks, namely HE extraction and TER extraction. The former subtask is to distinguish all head-entities that may be involved with target relations, and the latter is to identify corresponding tail-entities and relations for each extracted head-entity. Next, these two subtasks are further deconstructed into several sequence labeling problems based on our proposed span-based tagging scheme, which are conveniently solved by a hierarchical boundary tagger and a multi-span decoding algorithm. Owing to the reasonable decomposition strategy, our model can fully capture the semantic interdependency between different steps, as well as reduce noise from irrelevant entity pairs. Experimental results show that our method outperforms previous work by 5.6%, 17.2% and 3.7% (F1 score), achieving a new state-of-the-art on three public datasets.
Extracting pairs of entities with relations from unstructured text is an essential step in automatic knowledge base construction, and an ideal extraction system should be be capable of extracting overlapping relations (i.e., multiple relations share a common entity) . Traditional pipelined approaches first recognize entities, then choose a relation for every possible pair of extracted entities. Such framework makes the task easy to conduct, but ignoring the underlying interactions between these two subtasks . One improved way is to train them jointly by parameter sharing [10, 3, 15]. Although showing promising results, these extract-then-classify approaches still require explicit separate components for entity extraction and relation classification. As a result, their relation classifiers may be misled by the redundant entity pairs [16, 2], since N entities will lead to roughly N pairs, and most of which are in the NA (non-relation) class.
Rather than extracting entities and relations separately, \citeauthorzheng2017joint \shortcitezheng2017joint propose an unified labeling scheme to model the triplets directly by a kind of multi-part tags. Nevertheless, this model lacks the elegance to identify overlapping relations. As the improvement, \citeauthordai2019joint \shortcitedai2019joint present PA-LSTM which directly labels entities and relations according to query positions, and achieves state-of-the-art results. However, according to our empirical study, this kind of methods always ignore the inner structure such as dependency included in the head entity, tail entity and relation due to the unified labeling-once process. As is well known, a tail-entity and a relation should be depended on a specific head-entity. In other words, if one model does not fully perceive the semantics of head-entity, it will be unreliable to extract the corresponding tail entities and relations. In addition, to recognize overlapping relations, PA-LSTM has to conduct labeling-once processes for an -word sentence, which means it is time-consuming and difficult to deploy.
As we see, for a complex NLP task, it is very common to decompose the task into different modules or processes, and a reasonable design is quite crucial to help one model make further progress [9, 20, 6]. Thus, in this paper, through analysis of the two kinds of methods above, we exploit the inner structure of joint extraction and propose a novel decomposition strategy, in which the task decomposes hierarchically into several sequence labeling problems with partial labels capturing different aspects of the final task (see Figure 1). Starting with a sentence, we first judiciously distinguish all the candidate head-entities that may be involved with target relations, then label corresponding tail-entities and relations for each extracted head-entity. We call the former subtask as Head-Entity (HE) extraction, and the later as Tail-Entity and Relation (TER) extraction. Such extract-then-label (ETL) paradigm can be understood by decomposing the joint probability of triplet extraction into conditional probability , where is a triplet in sentence . In this manner, our TER extractor is able to take the semantic and position information of the given head-entity into account when tagging tail-entities and relations, and naturally, one head-entity can interact with multiple tail-entities to form overlapping relations.
Next, inspired by extractive question answering which identifies answer span by predicting its start and end indices , we further decompose HE and TER extraction with a span-based tagging scheme. Specifically, for HE extraction, entity type is labeled at the the start and end positions of each head-entity. For TER extraction, we annotate the relation types at the start and end positions of all the tail-entities which have relationship to a given head-entity. To enhance the association between boundary positions, we present a hierarchical boundary tagger, which labels the start and end positions separately in a cascade structure and decode them together by a multi-span decoding algorithm. By this means, HE and TER extraction can be modeled in the unified span-based extraction framework, differentiated only by their prior knowledge and output label set. Overall, for a sentence with head-entities, the entire task is deconstructed into sequence labeling subtasks, the first for HE tagging and the other for TER. Intuitively, the individual subtasks are significantly easy to learn, suggesting that by trained cooperatively with shared underlying representations, they can constrain the learning problem and achieve a better overall outcome.
We conduct experiments on three public datasets: NYT-single, NYT-multi and WebNLG. The results show that our approach significantly outperforms previous work on both normal and overlapping relation extraction, increasing the SOTA F1 score on the three datasets to 59.0% (+5.6), 79.1% (+17.2) and 48.1% (+3.7), respectively.
In this section, we first introduce our tagging scheme, based on which the joint extraction task is transformed to several sequence labeling problems. Then we detail the hierarchical boundary tagger, which is the basic labeling module in our method. Finally, we move on to the entire extraction system.
Let us consider the head-entity (HE) extraction first. As discussed in the previous section, it is decomposed into two sequence labeling subtasks. The first sequence labeling subtask mainly focuses on identifying the start position of one head-entity. One token is labeled as the corresponding entity type if it is the start word, otherwise it is assigned the label “O” (Outside). In contrast, the second subtask aims to identify the end position of one head-entity and has a similar labeling process except the entity type is labeled for the token which is the end word.
For each identified head-entity, the tail-entity and relation (TER) extractor is also decomposed into two sequence labeling subtasks which make use span boundaries to extract tail-entities and predict relations simultaneously. The first sequence labeling subtask mainly labels the relation type for the token which is the start word of the tail-entity, while the second subtask tags the end word of the tail-entity.
In Figure 1, we illustrate an example to demonstrate our tagging scheme. Based on the scheme, the words “Trump”, “United”, “States”, “New”, “City” and “Queens” are all related to the extracted results, thus they are tagged based on our special tags. For example, the word “Trump” is the first and also the last word of entity “Trump”, so the tags are both PERSON in the start and end tag sequences when tagging HE. For the TER extraction, when the given head-entity is “Trump”, there are two tail-entities involved in with a wanted relation, i.e., (“Trump”, President_Of, “United States”) and (“Trump”, Born_In, “New York City”), so “United” and “New” are labeled as President_Of and Born_In respectively in the start tag sequence. Similarly, we can obtain the end tag sequence that “States” and “City” are marked. Beyond that, the other words irrelevant to the final result are labeled as “O”.
Note that our tagging scheme is quite different from PA-LSTM . For an -word sentence, PA-LSTM builds different tag sequences according to different query position while our model tags the same sentence for times to recognize all overlapping relations, where is the number of head-entities and . This means our model is more time-saving and efficient. Besides, it uses “BIES” signs to indicate the position of tokens in the entity while we only predict the start and end positions without loss of the ability to extract multi-word entity mentions.
Hierarchical Boundary Tagger
According to our tagging scheme, we utilize a unified architecture to extract HE and TER. In this paper, we wrap such extractor into a general module named hierarchical boundary tagger (abbreviated as HBT). For the sake of generality, we don’t distinguish between head and tail-entity, and they are collectively referred to as targets in this subsection. Formally, the probability of extracting a target with label (entity type for head-entity or relation type for tail-entity) from sentence is modeled as:
where is the start index of with label and is the end index. Such decomposition indicates that there is a natural order among the tasks: predicting end positions may benefit from the prediction results of start positions, which motivates us to employ a hierarchical tagging structure. As shown in the right panel of Figure 2, we associate each layer with one task and take the tagging results as well as hidden states from the low-level task as input to the high-level. In this work, we choose BiLSTM  as the base encoder. Formally, the label of word when tagging the start position is predicted as Eq. 4.
where is an input token representation and is an input auxiliary information vector. When extracting head entities, is a global representation learned from the entire sentence. It is beneficial to make more accurate predictions from a global perspective. For TER, is the concatenation of global representation with a head-entity-related vector to indicate the position and semantic information of the given head-entity. Here we adopt to fuse with into a single vector . Analogously, ’s end tag can be calculated by Eq. 6.
The difference between Eq. 2-4 and Eq. 5-7 is twofold. Firstly, we replace in Eq. 2 with to make model aware of the hidden states of start positions when predicting end positions. Secondly, inspired by the position encoding vectors used in \citeauthorzeng2014relation \shortcitezeng2014relation, we feed the position embedding to the layer as its additional input. can be obtained by looking up in a trainable position embedding matrix, where
Here is the nearest start position before current index, and is the relative distance between and . When there is no start position before , will not exist, then is assigned as a constant that is normally set to the maximum sentence length. In this way, we explicitly limit the length of the extracted entity and teach model that the end position is impossible to be in front of the start position. To prevent error propagation, we use the gold (distance to the correct nearest start position) during training process.
We define the training loss (to be minimized) of HBT as the sum of the negative log probabilities of the true start and end tags by the predicted distributions:
where and are the true start and end tags of the -th word, respectively, and is the length of the input sentence.
At inference time, to adapt to the multi-target extraction task, we propose a multi-span decoding algorithm, as shown in Algorithm 1. For each input sentence , we first initialize several variables (Lines 1-4) to assist with the decoding: (1) is defined as the length of . (2) is initialized as an empty set to record extracted targets and type tags. (3) is introduced to hold the nearest start position before current index. (4) is initialized as a list of length with default value to save the position sequence .
Next, we obtain the start tag sequence by Eq. 4 (Line 5) and compute for each token by Eq. 8 (Lines 6-10). On the basis of , we can get by looking up position embedding matrix (Line 11) . Then the tag sequence of end position can be computed by Eq. 7 (Line 12).
Now, all preparations necessary are in place, we start to decoding sta_tag and end_tag. We first traverse sta_tag to find the start position of a target (Line 13). If the tag of current index is not “O”, it denotes that this position may be a start word (Line 14), then we will traverse end_tag from this index to search for the end position to match the found start position (Line 15). The matching criterion is that if the tag of the end position is identical to the start position (Line 16), the words between the two indices are considered to be a candidate target (Line 17), and the label of start position (or end position) is deemed as the tag of this target (Line 18). The extracted target along with its tag is then added to the set (Line 19), and the search in end_tag is terminated to continue to traverse sta_tag to find the next start position (Line 20). Once all the indices in sta_tag are iterated, this decoding function ends by returning the recordset (Line 21).
With the span-based tagging scheme and the above hierarchical boundary tagger, we propose an end-to-end neural architecture (Figure 2) to extract entities and overlapping relations jointly. Our model first encodes the -word sentence using a shared BiLSTM encoder. Then, we build a HE extractor to extract head entities. For each extracted head entity, the TER extractor is triggered with this head-entity’s semantic and position information to detect corresponding tail-entities and relations.
Given sentence , we use BiLSTM to incorporate information from both forward and backward directions:
where is the hidden state at position , and is the word representation of which contains pre-trained embeddings and character-based word representations generated by running a CNN on the character sequence of . We also employ part-of-speech (POS) embedding to enrich .
HE extractor aims to distinguish candidate head-entities and exclude irrelevant ones. We first concatenate and to get the feature vector , where is a global contextual embedding computed by max pooling over all hidden states. Actually, works as the for each token in Eq. 2.
Moreover, we use to denote all the word representations for HE extraction and subsequently feed into one HBT to extract head-entities:
where contains all the head-entities and corresponding entity type tags in .
Similar to HE Extractor, TER Extractor also uses the basic representation and global vector as input features. However, simply concatenating and is not enough for detecting tail-entities and relations with the specific head-entity. The key information required to perform TER extraction includes: (1) the words inside the tail-entity; (2) the depended head-entity; (3) the context that indicates the relationship; (4) the distance between tail-entity and head-entity. Under these considerations, we propose the position-aware, head-entity-aware and context-aware representation . Given a head-entity , we define as follows:
where denotes the representation of head-entity , in which and are the hidden states at the start and end indices of respectively. is the position embedding to encode the the relative distance from current word to . Obviously, is the auxiliary feature vector for TER extraction as in Eq. 2.
It is worth noting that at the training time, is the gold head-entity, while at the inference time we select head-entity one by one from to complete the extraction task.
Formally, we take as input to one HBT, and the output , in which is the -th extracted tail-entity and is its relation tag with the given head-entity.
Then we can assemble triplets by combining and each to form , which contains all triplets with head-entity in sentence .
Training of Joint Extractor
Two learning signals are provided to train the model: for HE extraction and for TER extraction, both are formulated as Eq.9. To share input utterance across tasks and train them jointly, for each training instance, we randomly select one head-entity from gold head-entity set as the specified input of the TER extractor. We can also repeat each sentence many times to ensure all triplets are utilized, but the experimental results show that this is not beneficial. Finally, the joint loss is given by:
where is a weighting hyperparameter to balance the two components. Then, the model is trained with stochastic gradient descent. Optimizing Eq.14 enables the extraction of head-entity, tail-entity, and relation to be mutually influenced, such that, errors in each component can be constrained by the other.
We conduct experiments on three benchmark datasets: (1) NYT-single is sampled from the New York Times corpus  and published by \citeauthorren2017cotype \shortciteren2017cotype. The training data is automatically labeled using distant supervision, while 395 sentences are annotated manually as test data, most of which have single triplet in each sentence. (2) NYT-multi is published by \citeauthorzeng2018extracting \shortcitezeng2018extracting for testing overlapping relation extraction, they selected 5000 sentences from NYT-single as the test set, 5000 sentences as the validation set and the rest 56195 sentences are used as training set. (3) Wiki-KBP is sampled from 780k Wikipedia articles and automatically labeled by \citeauthorliu2017heterogeneous \shortciteliu2017heterogeneous, while the test set is selected by \citeauthorren2017cotype\shortciteren2017cotype. Statistics of the datasets are shown in Table 1.
Following previous works, we use the F1 metric computed from Precision (Prec.) and Recall (Rec.) for evaluation. A triplet is marked correct when its relation type and two corresponding entities are all correct. For NYT-single and Wiki-KBP, we create a validation set by randomly sampling 10% sentences from test set as previous studies [21, 2] did.
Following popular choices and previous work, we use the 300 dimension Glove  to initialize word embeddings. We randomly initialize the POS, char, and position embeddings with 30-dimension vectors. The window size of CNN for character-based word representations is set to 3, and the number of filters is 50. For the BiLSTM component in our system, we use a 1-layer network with hidden state size 100. Parameter optimization is performed using Adam  with learning rate 0.001 and batch size 64. Dropout is applied to embeddings and hidden states with a rate of 0.4. is chosen from via grid search. To prevent the gradient explosion problem, we set gradient clip-norm as 5. All the hyper-parameters are tuned on the validation set. We run 5 times for each experiment, then report the average results.
|# Relation types||24||24||14|
|# Entity types||3||3||3|
|# Training sentences||66,335||56,195||75,325|
|# Test sentences||395||5,000||289|
For comparison, we employ the following models as baselines: (1) NovelTagging  is the first proposed unified sequence tagger which predicts both entity type and relation class for each word. (2) MultiDecoder  considers relation extraction as a seq2seq problem and uses dynamic decoders to extract relation triplets. (3) TME  first identifies all candidate entities, then perform relation extraction by ranking candidate relations with the translation mechanism; these two tasks are trained jointly. (4) PA-LSTM  tags entity and relation labels according to a query word position and achieves the recent state-of-the-art result on NYT-single and Wiki-KBP. (5) GraphRel  is the latest state-of-the-art method on NYT-multi, which first employs GCNs to extract hidden features, then predicts relations for all word pairs of an entity mention pair extracted by a sequence tagger.
We call our proposed span-based extract-then-label method as ETL-Span. In addition, to access the performance influence of our span-based scheme, we also implement another competitive baseline by replacing our tagger with widely used BiLSTM-CRF without any change in the input features ( and ), and utilize BIES-based scheme accordingly, which associates each type tag (entity type or relation type) with four position tags to indicate the position of entities and types simultaneously, denoted as ETL-BIES.
Experimental Results and Analyses
Table 2 summarizes the comparison results on the three datasets. Overall, our method significantly outperforms others and achieves the state-of-the-art F1 score on all three datasets. Compared to the current best extrat-then-classify method GraphRel, ETL-Span achieves substantial improvements of 17.2% in F1 on the NYT-multi dataset. We attribute the performance gain to two design choices: (1) the integration of tail-entity and relation extraction as it captures the interdependency between entity recognition and relation classification; (2) the exclusion of redundant (non-relation) entity pairs by the judicious recognition of head-entities which are likely to take part in some relations. For the NYT-single dataset, ETL-Span outperforms PA-LSTM by 5.6% in F1. We consider that it is because (1) we decompose the difficult joint extraction task into several more manageable subtasks and handle them in a mutually enhancing way; and (2) our TER extractor effectively captures the semantic and position information of the depended head-entity, while PA-LSTM detects tail-entities and relations relying on a single query word.
We can also observe that ETL-Span performs remarkably better than ETL-BIES, we guess it is because ETL-BIES must do additional work to learn the semantics of the BIES tags, while in ETL-Span, the entity position is naturally encoded by the set of type labels, thus reducing the tag space of each functional tagger. Another advantage of span-based tagging is that it avoids the computing overhead of CRF, as shown in Table 3, ETL-Span accelerates the decoding speed of ETL-BIES by up to 2.4 times. The main reason is that decoding the best chain of labels with CRF requires a significant amount of computing resources. Besides, ETL-Span only takes about 1/4 time per batch and 1/3 GPU memory compared with ETL-BIES during training, which further verdicts the superiority of our span-based scheme.
We notice that the Precision of our model drops compared with NovelTagging and PA-LSTM on the Wiki-KBP dataset. One possible reason is that many overlapping relations are not annotated in the test data of Wiki-KBP. Following \citeauthordai2019joint \shortcitedai2019joint, we add some gold triplets into Wiki-KBP test set and further achieve a large improvement of 13.3% in F1 and 16.9% in Precision compared with the results in Table 2.
|ETL-BIES||10.9 Bat/s||11.4 Bat/s||16.2 Bat/s|
|ETL-Span||26.1 Bat/s||25.8 Bat/s||27.9 Bat/s|
To demonstrate the effectiveness of each component, we remove one particular component at a time to understand its impact on the performance. Concretely, we investigated character embedding, POS embedding, Position embedding and Hierarchical tagging (by tagging start positions and end positions at the outmost BiLSTM layer). Table 4 summarizes the results on NYT-single. From these ablations, we find that: (1) Consistent with previous work , the character-level representations and POS embeddings are helpful to capture the morphological information and deal with OOV words. (2) Introducing global representation seems an efficient way to incorporate the information of sentence-level content and make prediction for each word from a global perspective. (3) When we remove , the score drops by 3.4%, which indicates that it is vital to let tail-entity extractor aware of position information of the given head-entity to filter out irrelevant entities by implicit distance constraint. (4) Removing the hierarchical tagging structure hurts the result by 2.7% F1 score, which indicates that predicting end positions benefits from the prediction results of start positions.
|– Char embedding||56.7|
|– POS embedding||57.6|
|– Global representation||56.9|
|– Position embedding||56.0|
|– Hierarchical tagging||56.7|
Analysis on Joint Learning
As shown in Figure 3, we analyze influence of different values of on performance of HE, TER and overall triplet extraction. In the process of increasing , our model gradually pays more attention to HE extraction and vice versa. It is interesting to see that leads to the worst HE extraction performance, similar trends are also observable on the TER extraction. This demonstrates that our HE extractor and TER extractor actually work in the mutual promotion way, which again confirms the effectiveness and rationality of our decomposition strategy. Another intriguing observation is that, the performance of all three tasks peaks when , which means our model needs to concentrate more on TER extraction, presumably because TER extraction with a larger decision space is more difficult than HE extraction.
Analysis on Overlapping Relation Extraction
Following \citeauthorzeng2018extracting \shortcitezeng2018extracting and \citeauthorfu-etal-2019-graphrel \shortcitefu-etal-2019-graphrel, we divide the test set of NYT-multi into three categories: Normal, SingleEntityOverlap (SEO), and EntityPairOverlap (EPO) to verify the effectiveness of our model on extracting overlapping relations. A sentence belongs to Normal class if none of its triplets has overlapping entities. If the entity pairs of two triplets are identical but the relations are different, the sentence will be added to the EPO set. And a sentence belongs to SEO class if some of its triplets have an overlapped entity and these triplets don’t have overlapped entity pair. Note that a sentence in the EPO set may contain multiple Normal and SEO triplets. The results are shown in Figure 4111Here we don’t compare our method with PA-LSTM because PA-LSTM does not release source code, and it is difficult to reproduce the results as in the original papers. .
Among the compared baselines, GraphRel and MultiDecoder are the only two models have the capacity to handle the EPO triplets. For this purpose, GraphRel predicts relations for all word pairs, in this case, its relation classifier will be overwhelmed by the superfluous candidates. Readers may have noticed that our model cannot solve the problem of entity pair overlapping. Nevertheless, we still surpass baselines by a substantial margin in all categories. Specifically, our model outperforms GraphRel by 17.4% on the Normal class, 16.9% on the SEO class, and 4.1% on the EPO class. In fact, even on the EPO set, there are still a significant amount of triplets where entity pairs don’t overlap. The most common triplets in the real-life corpus are those of Normal and SEO class and our substantial surpass on these two categories masks our shortcomings on the EPO class. We leave the identification of EPO triplets for future work.
Researchers have proposed several methods to extract both entities and relations. Traditional pipelined methods [18, 1] neglect the relevance of entity extraction and relation prediction. To resolve this problem, several joint models have been proposed. Feature-based works [17, 11] need complicated process of feature engineering. Neural models for joint relation extraction are investigated in recent studies [4, 21], they show promising results but completely giving up overlapping relations. To overcome this limitation, \citeauthorzeng2018extracting \shortcitezeng2018extracting propose a sequence-to-sequence model to decode overlapping relations but fail to generate multi-word entities. \citeauthorsun2018extracting \shortcitesun2018extracting optimize a global loss function to jointly train the two models under the framework work of Minimum Risk Training. \citeauthordai2019joint \shortcitedai2019joint extract triplets by tagging one sentence for times which is time-consuming with time complexity. TME  solves this task via ranking with translation mechanism. \citeauthortakanobu2019hierarchical \shortcitetakanobu2019hierarchical deal with relation extraction by firstly determining relations and then recognizing entity pairs via reinforcement learning. \citeauthorli2019entity \shortciteli2019entity cast the task as a multi-turn QA problem and generate questions by relation-specific templates. \citeauthorsun2019joint \shortcitesun2019joint develop a entity-relation bipartite graph to perform joint inference on entity types and relation types. \citeauthorfu-etal-2019-graphrel \shortcitefu-etal-2019-graphrel also utilize graph convolutional network to extract overlapping relations by splitting entity mention pairs into several word pairs and considering all pairs for prediction.
Our span-based tagging scheme is inspired by recent advances in machine reading comprehension , which derive the answer by predicting its start and the end indices in the paragraph. \citeauthorhu2019open \shortcitehu2019open also apply this sort of architecture to open-domain aspect extraction. However, unlike these works that predict the start index and end index at one level, our approach passes the prediction information of start indices to higher layer to obtain the end indices, thus better capturing the links between boundary positions.
In this paper, we hierarchically decompose the entity-relation extraction task into several sequence labeling subtasks with partial labels, and solve them in an unified framework. Experimental results show that the functional decomposition of the original task simplifies the learning process and leads to a better overall learning outcome, achieving a new state-of-the-art on three datasets. In the future, we will conduct research on how to apply such decomposition strategy to other information extraction tasks.
-  (2011) Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 551–560. Cited by: Related Work.
-  (2019) Joint extraction of entities and overlapping relations using position-attentive sequence labeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6300–6308. Cited by: Introduction, Tagging Scheme, Evaluation, Comparison Models, Ablation Study, Table 2.
-  (2019) GraphRel: modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, pp. 1409–1418. External Links: Cited by: Introduction, Comparison Models, Table 2.
-  (2016) Table filling multi-task recurrent neural network for joint entity and relation extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2537–2547. Cited by: Related Work.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Hierarchical Boundary Tagger.
-  (2019) Open-domain targeted sentiment analysis via span-based extraction and classification. arXiv preprint arXiv:1906.03820. Cited by: Introduction.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
-  (2014) Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 402–412. Cited by: Introduction.
-  (2018) Empower sequence labeling with task-aware neural language model. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
-  (2016) End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:1601.00770. Cited by: Introduction.
-  (2014) Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1858–1869. Cited by: Related Work.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Implementation Details.
-  (2010) Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: Datasets.
-  (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: Introduction, Related Work.
-  (2019) Joint type inference on entities and relations via graph convolutional networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 1361–1370. Cited by: Introduction.
-  (2019) Jointly extracting multiple triplets with multilayer translation constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Introduction, Comparison Models, Table 2, Related Work.
-  (2010) Jointly identifying entities and extracting relations in encyclopedia text via a graphical model approach. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1399–1407. Cited by: Related Work.
-  (2003) Kernel methods for relation extraction. Journal of machine learning research 3 (Feb), pp. 1083–1106. Cited by: Related Work.
-  (2018) Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514. Cited by: Introduction, Comparison Models, Table 2.
-  (2019) Sentiment tagging with partial labels using modular architectures. arXiv preprint arXiv:1906.00534. Cited by: Introduction.
-  (2017) Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1227–1236. Cited by: Evaluation, Comparison Models, Table 2, Related Work.