Combining Spans into Entities:A Neural Two-Stage Approach for Recognizing Discontiguous Entities

Combining Spans into Entities:
A Neural Two-Stage Approach for Recognizing Discontiguous Entities

Bailin Wang
ILCC, School of Informatics
University of Edinburgh
&Wei Lu
StatNLP Research Group
Singapore University of Technology and Design

In medical documents, it is possible that an entity of interest not only contains a discontiguous sequence of words but also overlaps with another entity. Entities of such structures are intrinsically hard to recognize due to the large space of possible entity combinations. In this work, we propose a neural two-stage approach to recognizing discontiguous and overlapping entities by decomposing this problem into two subtasks: 1) it first detects all the overlapping spans that either form entities on their own or present as segments of discontiguous entities, based on the representation of segmental hypergraph, 2) next it learns to combine these segments into discontiguous entities with a classifier, which filters out other incorrect combinations of segments. Two neural components are designed for these subtasks respectively and they are learned jointly using a shared encoder for text. Our model achieves the state-of-the-art performance in a standard dataset, even in the absence of external features that previous methods used.

1 Introduction

Named entity recognition (NER) aims at identifying shallow semantic elements in text and has been a crucial step towards natural language understanding tjong2003introduction. Extracted entities can facilitate various downstream tasks like question answering abney2000answer, relation extraction mintz2009distant; liu2017heterogeneous, event extraction riedel2011fast; lu2012automatic; li-ji-huang:2013:ACL2013, and coreference resolution soon2001machine; ng2002improving; chang2013constrained.

The underlying assumptions behind most NER systems are that an entity should contain a contiguous sequence of words and should not overlap with each other. However, such assumptions do not always hold in practice. First, entities or mentions111Mentions are defined as references to entities that could be named, nominal or pronominal florian2004statistical. with overlapping structures frequently exist in news doddington2004automatic and biomedical documents kim2003genia. Second, entities can be discontiguous, especially in clinical texts pradhan2014evaluating. For example, Figure 3 shows three entities where two of them are discontiguous (“laceration …esophagus” and “stomach …lac”), and the second discontiguous entity also overlaps with another entity (“blood in stomach”).

Figure 1: Entities are highlighted with colored underlines. “laceration … esophagus” and “stomach … lac” contain discontiguous sequence of words and the latter also overlaps with another entity “blood in stomach”.

Such discontiguous entities are intrinsically hard to recognize considering the large search space of possible combinations of entities that have discontiguous and overlapping structures. muis2016learning proposed a hypergraph-based representation to compactly encode discontiguous entities. However, this representation suffers from the ambiguity issue during decoding – one particular hypergraph corresponds to multiple interpretations of entity combinations. As a result, it resorted to heuristics to deal with such an issue.

Motivated by their work, we take a novel approach to resolve the ambiguity issue in this work. Our core observation is that though it is hard to exactly encode the exponential space of all possible discontiguous entities, recent work on extracting overlapping structures D18-1019 can be employed to efficiently explore the space of all the span combinations of discontiguous entities. Based on this observation, we decompose the problem of recognizing discontiguous entities into two subtasks: 1) segment extraction: learning to detect all (potentially overlapping) spans that either form entities on their own or present as parts of a discontiguous entity; 2) segment merging: learning to form entities by merging certain spans into discontiguous entities.

Our contributions are summarized as follows:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • By decomposing the problem of extracting discontiguous entities into two subtasks, we propose a two-stage approach that does not have the ambiguity issue.

  • Under this decomposition, we design two neural components for these two subtasks respectively. We further show that the joint learning setting where the two components use a shared text encoder is beneficial.

  • Empirical results show that our system achieves a significant improvement compared with previous methods, even in the absence of external features that previous methods used. 222Our code is available at

Though we only focus on discontiguous entity recognition in this work, our model may find applications in other tasks that involve discontiguous structures, such as detecting gappy multiword expressions schneider2014discriminative.

2 Related Work

The task of extracting overlapping entities has long been studied zhang2004enhancing; zhou2004recognizing; zhou2006recognizing; mcdonald2005flexible; alex2007recognising; finkel2009nested; lu2015joint; muis2017labeling. As neural models collobert2011natural; lample2016neural; huang2015bidirectional; chiu2016named; ma-hovy:2016:P16-1 are proven effective for NER, there have been several neural systems recently proposed to handle entities of overlapping structures N18-1131; N18-1079; D18-1124; D18-1309; D18-1019; strakova-etal-2019-neural; lin-etal-2019-sequence; fisher-vlachos-2019-merge. Our system is based on the model of neural segmental hypergraphs D18-1019 which encodes all the possible combinations of overlapping entities using a compact hypergraph representation without ambiguity. Note that other system for extracting overlapping structures can also fit into our two-stage system.

For discontiguous and overlapping entity recognition, tang2013recognizing; zhang2014uth_ccb; xu2015uth extended the BIO tagging scheme to encode such complex structures so that traditional linear-chain CRF lafferty2001conditional can be employed. However, the model suffers greatly from ambiguity during decoding due to the use of the extended tagset. muis2016learning proposed a hypergraph-based representation to reduce the level of ambiguity. Essentially, these systems trade expressiveness for efficiency: they inexactly encoded the whole space of discontiguous entities with ambiguity for training, and then relied on some heuristics to handle the ambiguity during decoding. 333We will briefly introduce them and their heuristics later as our baselines in our experiments for comparison. Considering it is intrinsically hard to exactly identify discontiguous entities in one stage using a structured model, our work tries to decompose the task into two sub-tasks to resolve the ambiguity issue.

This task is also related to joint entity and relation extraction kate2010joint; li2014incremental; miwa2014modeling where the discontiguous entities can be viewed as relation links between segments. The major difference is that discontiguous entities require explicitly modeling overlapping entities and linking multiple segments.

3 Model

Our goal is to extract a set of entities that may have overlapping and discontiguous structures given a natural language sentence. We use to denote a sentence, and use to denote a set of discontiguous entities where each entity of type contains a list of spans, e.g., and , with subscripts indicating the starting and ending positions of the span. Hence, this task can be viewed as extracting and labelling a sequence of spans as an entity.

Our two-stage approach first extracts spans of interest like , which are parts of discontiguous entities. Then it merges these extracted spans into discontiguous entities. In the more general setting where discontiguous entities are typed, our approach is designed to jointly extract and label the spans at the first stage, then only merge the spans of the same type at the second stage. We call the intermediate typed span a segment in the rest of the paper.

Formally, our model aims at maximizing the conditional probability , which is decomposed as:


where denotes the set of segments that leads to through a specific combination. 444We note that each corresponds to one unique . That is, we divide the problem of extracting discontiguous entities into two subtasks, namely segment extraction and segment merging.

3.1 Segment Extraction

The entity segments of interest in a given sentence could also overlap with each other. For example, in Figure 3 the entity “blood in stomach” contains another segment “stomach”. To make our model capable of extracting such overlapping segment combinations, we employ the model of neural segmental hypergraphs from D18-1019, which uses a hypergraph-based representation to encode all the possible combinations of segments without ambiguity. Specifically, the segmental hypergraphs adopt a log-linear approach to model the conditional probability of each segment combination for a given sentence:


where is the score function for any pair of input sentence and output segment combination .

In segmental hypergraphs, each segment combination corresponds to a hyperpath. Following D18-1019, the score for a hyperpath is the sum of the scores for each hyperedge, which are based on the word-level and span-level representations through LSTM graves2005framewise:


where is the corresponding word embedding for word , denotes the representation for the -th word and denotes the representation for the span from the -th to the -th word.

On top of the segmental hypergraph representation, the partition function which is the denominator of Equation 2 can be computed using dynamic programming. The inference algorithm has a quadratic time complexity in the number of words, which can be further reduced to linear time complexity if we introduce the maximal length of a segment. We regard as a hyperparameter.

# sents # entities (%) # o.l. (%)
1 segment 2 segments 3 segments
Train 534 544 (46) 607 (51) 44 (4) 205 (17)
Dev 303 357 (45) 421 (53) 18 (2) 240 (30)
Test 430 584 (48) 610 (50) 16 (1) 327 (27)
Table 1: Statistics of the dataset. o.l.: overlapping entities, sents: sentences.

3.2 Segment Merging

Given a set of segments, our next subtask is to merge them into entities. First, we enumerate all the valid segment combinations, denoted as , based on the assumption that the segments in the same entity should have the same type and not overlap with each other. Our model then independently decides whether each valid segment combination forms an entity. We call these valid segment combinations entity candidates. For brevity, let us use to denote an entity candidate where each segment like belongs to .

Formally, given segments , the probability of generating entities can be represented as:


where is an indicator function. We use a binary classifier to model .

To capture the interactions between segments within the same combination, we employ yet another LSTM on top of segments as follows:


where denotes the representation of the segment combination , which then serves as a feature vector for a binary classifier to determine whether it is an entity. Note that we reuse the span representation from Equation 4, meaning that encoder for words and spans are shared in both segment extraction and merging.

The binary classifier for each in Equation 5 is computed as:


where we use a rectified linear unit (ReLU) glorot2011deep and a linear layer, parameterized by and , to map the representation from Equation 6 to a scalar score. This score is normalized into a distribution by the function.

In the joint model, we stack three separate LSTMs to encode text at different levels from words to spans, then discontiguous entities. Intuitively, the word and span level LSTM try to capture the lower-level information for segment extraction while the entity level LSTM captures the higher-level information for segment merging.

3.3 Learning and Decoding

For a dataset consisting of sentence-entities pairs , our objective is to minimize the negative log-likelihood as follows:


where denotes model parameters and is the coefficient. is computed by where is inferred from .

During the decoding stage, the system first predicts the most probable segments from the neural segmental hypergraph by . Then it feeds the prediction to the next stage of merging segments and outputs the discontiguous entities by .

4 Experiments

4.1 Setup


We evaluated our model on the task of recognizing mentions in clinical text from ShARe/CLEF eHealth Evaluation Lab (SHEL) 2013 suominen2013overview and SemEval-2014 pradhan2014semeval. The task is defined to extract mentions of disorders from clinical documents according to the Unified Medical Language System (UMLS). The original dataset only has a small percentage of discontiguous entities, making it not suitable for comparing the effectiveness of different models when handling discontiguous entities. Following muis2016learning, we use a subset of the original data where each sentence contains at least one discontiguous entity.

We split the dataset according to the setting of SemEval 2014. Statistics are shown in Table 1. In this subset, 53.6% of entities are discontiguous. Overlapping entities also frequently appear. Since an entity has three segments at most, we make the constraint that an entity candidate has no more than three segments during segment merging.

Note that all entities hold the same type of disorder in this dataset. Our model is intrinsically able to handle discontiguous entities of multiple types. Recall that segments are typed as during segment extraction, and only segments of the same type can be merged into an entity where indicates the entity type. To assess its ability to deal with multiple entity types, we conducted a further analysis (see section 4.2).


We use the pretrained word embeddings from chiu2016train which are trained on the PubMed corpus. A dropout layer srivastava2014dropout is used after each word is mapped to its embedding. The dropout rate and the number of hidden units in LSTMs are tuned based on the performance on the development set. We set the maximal length of a segment to be 6 during segment extraction. Our model is trained with Adam kingma2014adam. 555See Appendix C for the full hyperparameters.


The first baseline we consider is to extend the traditional BIO tagging scheme to seven tags following tang2013recognizing. With this tagging scheme, each word in a sentence is assigned a label. Then a linear-chain CRF is built to model the sequence labelling process. The next baseline is a hypergraph-based method by muis2016learning. It encodes each entity combination into a directed graph based on six types of nodes; each has its specific semantics.

Since these baselines are both ambiguous, heuristics are required during decoding. Following muis2016learning, we explored two heuristics: given a model’s ambiguous output, either a tag sequence or a hypergraph, the “enough” heuristic finds the minimal set of entities that corresponds to it, while “all” decodes the union of all the possible set of entities. Please refer to muis2016learning for details. We also describe them in the Appendix for self-containedness.

We compare our approach to these baselines in two settings. In the non-neural setting, we compare models using the same set of handcrafted features, including external features from POS tagger and Brown cluster following muis2016learning. In the neural setting, we implement a linear-chain CRF model using the same neural encoder. We are trying to see our model can perform better in both settings. Note that all neural models in our experiments do not leverage any handcrafted features.

Non- Neural CRF (enough) 54.7 41.2 47.0
CRF (all) 15.2 44.9 22.7
Graph (enough) 76.9 40.1 52.7
Graph (all) 76.0 40.5 52.8
Our model 76.3 41.4 53.7
Neural CRF (enough) 43.7 54.3 48.4
CRF (all) 15.7 55.8 24.5
Our model 48.4 66.5 56.1
 w.o. shared encoder 46.2 65.1 54.0
Table 2: Main results. Graph: the hypergraph based model by muis2016learning. “enough” and “all” denotes the heuristics used in ambiguous model. w.o. shared encoder: without using shared encoder.

4.2 Results and Analysis

The main results are listed in Table 2. In both non-neural and neural settings, our model achieves the better result in terms of compared with other baselines, revealing the effectiveness of our methodology of decomposing the task into two stages. Our neural model achieves the best performance even without using any external handcrafted features.

We also assess the performance when our model uses separate encoders for segment extraction and merging. From the results, we observe that the setting of using a shared encoder is very beneficial for our two-stage system.

Compared with non-neural models, neural models are better in terms of , both for CRF and our models. The gain mostly comes from the ability to recall more entities. Handcrafted features in non-neural models lead to high precisions but do not seem to be general enough to recall most entities.

The “enough” heuristic works better than “all” in most cases. Hence we use it for evaluating models’ ability in handling multiple entity types.

Handling Multiple Entity Types

To assess the effectiveness of handling entities of multiple types, we further categorize each entity into three types based on its Concept Unique Identifier (CUI), following muis2016learning. 666Note that the label of CUI is not available for all entities. Entities without CUI are assigned with a default NULL label. In this setting, segments are jointly extracted and labelled using these three categories during segment extraction. During segment merging, an entity candidate can only contain segments of the same type during merging.

The results are listed in Table 3. Our neural model again achieves the best performance among all models in terms of . Compared with neural CRF, our model is significantly better at recalling entities. Similar to the previous observation, the neural encoder consistently boosts the performance of the CRF by recalling more entities, compared with its non-neural counterpart.

Non- Neural CRF (enough) 55.3 37.4 44.6
Graph (enough) 67.3 37.5 48.2
Neural CRF (enough) 41.6 52.3 46.3
Our model 43.3 65.8 52.2
Table 3: Results on handling multiple entity types.

5 Conclusion and Future Work

In this work, we propose a neural two-stage approach for recognizing discontiguous entities, which learns to extract and merge segments jointly without suffering from ambiguity issue. Empirically, it achieves a significant improvement compared with previous methods that rely heavily on handcrafted features.

During training, the classifier of merging segments is only exposed to correct segments, making it unable to recover from errors of segment exaction during decoding. This issue is similar to exposure bias D16-1137 and it might be beneficial if the classifier of segment merging is exposed to incorrect segments during training. We leave this for future work.


We would like to thank the anonymous reviewers for their valuable comments. Wei Lu is supported by Singapore Ministry of Education Academic Research Fund (AcRF) Tier 2 Project MOE2017-T2-1-156, and is partially supported by SUTD project PIE-SGP-AI-2018-01.


Appendix A Segment Extraction

Neural segmental hypergraphs D18-1019 were proposed for modeling overlapping structures in entity mentions. We directly adopt their approach to model the segments of overlapping structures. Note that our segment also holds the information of entity type, so the resulting system for segment extraction can also be viewed as performing sub-mention recognition. Next, we illustrate how the segmental hypergraph encodes the overlapping segments by a concrete example. For brevity, we only show the example that is annotated with one entity type, and it is able to be trivially extended to the case of multiple entity types.

Given a phrase “He had blood in his mouth and on his tongue”, there exist two disorder mentions: ‘blood in his mouth’ and ‘blood … on his tongue’ where the second mention has a discontiguous sequence of words. Our two-stage approach first extracts segments that lead to these two mentions. In this example, the segments consist of ‘blood’, ‘blood in his mouth’ and ‘on his tongue’. We observe that the first two segments overlap with each other.

Segmental hypergraph encodes this segment combination based on five types of nodes:

  • encodes all segments that start with the -th or a later word

  • encodes all segments that start exactly with the -th word

  • represents all segments of type starting with the -th word

  • represents all segments of type that contain the -th word and start with the -th word

  • marks the end of a segment.

Each segment can be expressed in terms of these five nodes and corresponds with a path in the segmental hypergraph. As a result, each segment combination corresponds with a hyperpath where hyperedges are designated to connect multiple nodes so as to model overlapping segments. Figure 2 shows such a hyperpath for the segment combination in our example phrase. Since we only have one entity type in this example, we eliminate the superscript in and nodes that indicates the information of entity type.


[width=0.95 ]example

Figure 2: A hyperpath that encodes three mention segments: ‘blood’, ‘blood in his mouth’ and ‘on his tongue’.

Starting from the third word ‘blood’, there exist two segments ‘blood’ and ‘blood in his mouth’. The brown hyperedge with the parent node being is responsible for connecting these two overlapping segments. This hyperedge means that there exists a segment that ends at the third word (the link from to ) and there also exists a segment that continues to the next word (the link from to ). The segment ‘on his tongue’ is directly mapped to the path from to .

The score for each hyperpath is the sum of the scores that are computed over each hyperedge. Since nodes encode word-level information and nodes encode span-level information, two LSTMs are employed to capture the interactions at both word level and span level respectively. We use their original implementation that is publicly available777

Appendix B Heuristics for Handling Ambiguity

Figure 3: Entities are highlighted with colored underlines. “laceration … esophagus” and “stomach … lac” contain discontiguous sequence of words and the latter also overlaps with another entity “blood in stomach”.
Figure 4: Entities annotated using seven tags.

This section tries to explain the two heuristics “enough” and “all” when ambiguous tag sequences occur. We use the extended BIO tagging scheme tang2013recognizing; muis2016learning for example.

To encode the three discontiguous entities in Figure 3, this tagset has seven tags:

  • B/I: Beginning and Inside of contiguous entities

  • BH/IH: Beginning and Inside of head where head refers to segments shared by multiple discontiguous entities.

  • BD/ID: Beginning and Inside of body where body refers to segments that are not shared across entities.

  • O: Outside of entities.

The resulting tag sequence is shown in Figure 4. Since this tagging scheme cannot model the correspondence between different tags, tagging sequences are very likely to have multiple interpretations. For instance, it is not clear that “laceration” should be combined with “esophagus” or with “stomach”.

The “all” heuristics extracts all the possible entities that could exist in the tagging sequence. In this case, “all” heuristics will produce “laceration … esophagus” , “stomach … lac”, “blood in stomach”, “laceration … lac”, “esophagu … lac”, “laceration …esophagus … lac”, “laceration … stomach”, “esophagus … stomach”, “laceration… stomach … lac”, “esophagus … stomach…lac”.

The “enough” heuristics tries to find the minimal set of entities that corresponds to this tagging sequence. In this case, “enough” heuristics would produce at least three entities like: “laceration … esophagus” , “stomach … lac” and “blood in stomach”; “laceration … lac” , “blood in stomach” and “esophagus … stomach”. We make further constraints to generate only one combination following muissupplementary.

Appendix C Hyperparameters

The hyperparameters used in our neural two-stage model are listed in Table 4. Since the size of our dataset is relatively small, the dropout is crucial to prevent overfitting considering that the pre-traind word embeddings have the dimension of 200. The length of most segments is not greater than 6, so we set the maximal length to be 6 to improve the efficiency of segment extraction.

We also tried to incorporate a character-level component lample2016neural to capture morphological and orthographic information. However, it does not have a significant effect on the performance in term of .

word embedding dim 200
LSTM(word) hidden size 128
LSTM(span) hidden size 128
LSTM(entity) hidden size 64
maximal length 6
dropout 0.8
Table 4: Hyperparameters of our joint model.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description