A Unified Labeling Approach by Pooling Diverse Datasets for Entity Typing

A Unified Labeling Approach by Pooling Diverse Datasets for Entity Typing


Evolution of entity typing (ET) has led to the generation of multiple datasets. These datasets span from being coarse-grained to fine-grained encompassing numerous domains. Existing works primarily focus on improving the performance of a model on an individual dataset, independently. This narrowly focused view of ET causes two issues: 1) type assignment when information about the test data domain or target label set is not available; 2) fine-grained type prediction when there is no dataset in the same domain with finer-type annotations. Our goal is to shift the focus from individual domain-specific datasets to all the datasets available for ET. In our proposed approach, we convert the label set of all datasets to a unified hierarchical label set while preserving the semantic properties of the individual labels. Then utilizing a partial label loss, we train a single neural network based classifier using every available dataset for the ET task. We empirically evaluate the effectiveness of our approach on seven real-world diverse ET datasets. The results convey that the combined training on multiple datasets helps the model to generalize better and to predict fine-types across all domains without relying on a specific domain or label set information during evaluation.



Entity typing (ET) is the problem of assigning a label to an entity mention. For example, in the sentence, “Max was a Western lowland gorilla held at the Johannesburg Zoo.”, the entity type for mention Max will be an animal name, for mention Western lowland gorilla will be species and for mention Johannesburg Zoo will be zoo.1 ET is used in several natural language processing applications such as relation extraction [\citeauthoryearYaghoobzadeh, Adel, and Schütze2016], question answering [\citeauthoryearYahya et al.2013], search [\citeauthoryearDalton, Dietz, and Allan2014] and knowledge base construction [\citeauthoryearRoth et al.2015].

Figure 1: This figure illustrates the label set intersection of various entity typing datasets.

Several datasets exist for the ET task. These datasets can differ from each other in terms of their domain or the label set or both. Here, a domain represents the writing style and the vocabulary. The label set represents the entity types annotated. For example, CADEC [\citeauthoryearKarimi et al.2015] dataset contains sentences from social media domain with adversary drug reaction related labels annotated. Often among different datasets, there is partial overlap in the domains and label set. For example, BBN [\citeauthoryearWeischedel and Brunstein2005] dataset domain (Wall Street Journal texts) is also included in OntoNotes [\citeauthoryearWeischedel et al.2013] dataset. However, their label sets have partial overlap. Figure 1 illustrates the label set overlap of seven ET datasets. Existing work in entity typing focuses on individual datasets, completely ignoring other datasets available for the task. Thus the outcome has multiple models each tuned on a specific domain and label set. This narrowly focused approach of considering a single dataset as a world for ET creates two issues.

The first issue is related to the entity type assignment without a priori knowledge about test data characteristic. Given a test input without any information about its source domain and the target label set, how do we assign an entity type to the input and using which model? In the existing ET work, it is always assumed that the source domain and label set information is available and the learning models are benchmarked on their generalization capability on the same domain and label set. Thus, some models are more suited for a particular domain/label set, than other available models. When these models are deployed, then there is no control over the test input a user can submit, and if there is a mismatch between domain or label set, the performance of models suffers.

The second issue is related to fine-grained type prediction. The fine-types are distributed across different datasets covering limited domains. For example, only Wiki dataset [\citeauthoryearLing and Weld2012] with Wikipedia sentences has a type sports event and only BBN dataset, with Wall Street Journal text, has a type nationality. Often these types are mentioned in a single sentence. Now, even if an oracle provides information about the target types of the entities present in a sentence, multiple models are needed to make fine-grained predictions for a single sentence.

In this paper, we propose a novel framework to address both issues. The central idea of our work is that by using in conjunction a unified label set encapsulating all datasets, a label mapping from dataset specific label to unified label set and a partial loss function, both of the issues are resolved to a great extent.

In our proposed approach, we construct a unified hierarchical label set (UHLS), from the collection of all available label sets for ET task. In UHLS, the nodes are contributed by different datasets, and a parent-child relationship among nodes translate to a coarse-fine label relationship. During construction of UHLS, a mapping from every dataset specific label to the UHLS nodes is also constructed. We expect to have one to many mappings, as in the case of real-world datasets. For example, a coarse-grained label for a dataset could be mapped to two or more nodes in the UHLS introduced by some other dataset. To preserve the semantic properties of different labels, human judgment is involved during the process. Utilizing the UHLS and the mapping, we can view several domain-specific datasets as a collection of a multi-domain dataset having the same label set, i.e., UHLS. On this combined dataset, we propose to use a partial loss function. It enables fine-grained prediction across domains, even though for some domain the available dataset contributed till the coarse level of the UHLS. The end product is a single neural network classifier which can predict fine-grained labels across all available datasets. Further, during evaluation, the single classifier does not need any information about the target domain or the label set.

We validate our proposed approach in an experimental setting consisting of seven diverse datasets spanning several domains with different label sets. In an idealistic setting, where it is known about the test data characteristics during evaluation, our proposed framework performs competitively with the state-of-the-art narrow focused models. The real advantage of our approach is in a realistic setting, where the test data characteristics are not known. In this setting, our proposed approach outperforms competitive baselines with a considerable margin.

Our two major contributions can be described as follows. First, we propose a novel framework which makes it possible to train a single classifier on an amalgam of diverse ET datasets, enabling finer prediction across all the datasets. Second, we propose evaluation settings and the evaluation metrics to compare learning models predicting in different label sets.

Figure 2: An overview of the proposed framework.

Terminologies and Problem Description

In this section we formally define the related terminologies and the problem description.

Dataset: A dataset, , is a collection of . Here, corresponds to a corpus of sentences with entity boundaries annotated, corresponds to the domain and is the set of labels used to annotate each entity mention in the . It is possible that for two datasets, their domain might be same but label set is different or vice versa. Here the domain means the data characteristics such as writing style and vocabulary.

Label space: A label space for a particular label , is defined as a set of entities that can be assigned a label . For example, the label space for a label car includes mentions of all cars including that of label space of hatchback, SUV etc. For different datasets, even if there exists two labels with same name, the label space for each of them can be different. The label space information is defined in the annotation guidelines used to create datasets.

Type Hierarchy: A type hierarchy, , is a natural way to organize label set in a hierarchy. It is formally defined as , where is the type set and is the relation set, in which means that is a subtype of or in other words the label space of is subsumed within the label space of .

Problem description: Given datasets, , each having its own domain and label set, and respectively, the objective is to predict the finest label possible from the set of all labels, , for a test entity mention.

Proposed Approach

Our proposed approach is based on the following key observations and ideas:

  1. From the set of all available labels, , it is possible to construct a type hierarchy where .

  2. We can map each , to one or more than one node in , such that the is same as the label space of the union of the mapped nodes.

  3. Using the above hierarchy and mapping, now even if for some datasets we only have the coarse labels, i.e., the labels which are mapped to non-leaf nodes, a learning model with a partial hierarchy aware loss function can predict fine labels.

Figure 2 gives the complete overview of the proposed approach, which is described in the following subsections.

Unified Hierarchy Label Set and Label Mapping

The labels of entity mentions can be arranged in a hierarchy. For example, the label space of airports is subsumed in the label space of facilities. In literature, there exist several taxonomies, such as WordNet [\citeauthoryearMiller1995] and ConceptNet [\citeauthoryearLiu and Singh2004]. Even two ET datasets, BBN [\citeauthoryearWeischedel and Brunstein2005] and FIGER [\citeauthoryearLing and Weld2012] organize labels in a hierarchy.

When we analyzed the labels of several ET datasets, we observe that even if there exist two or more datasets with the same label name, it is not necessarily true that their label space will be same, semantically. For example, in a dataset CoNLL [\citeauthoryearTjong Kim Sang and De Meulder2003], the label space for the label location includes facilities, whereas in the dataset OntoNotes [\citeauthoryearWeischedel et al.2013] the location label space excludes facilities. These differences are because these datasets were created by different organizations, at different times and for a different objective. Figure 3 illustrates this label space interaction. Additionally, some of these labels are very specific to the domains, and not all of them are present in any publicly available taxonomies such as WordNet, ConceptNet or even knowledge bases (Freebase [\citeauthoryearBollacker et al.2008] or WikiData [\citeauthoryearVrandečić and Krötzsch2014]).

Thus, to construct UHLS, we analyzed the annotation guidelines of several datasets and came up with an algorithm formally described in Algorithm 1 and explained below.

Given the set of all labels, , the goal is to construct a type hierarchy, and a label mapping . Here, is the set of labels present in the hierarchy, is the relation set and is the power set of the label set. To construct , we start with an initial type hierarchy, which can be or initialized by any existing hierarchy. We keep on processing each label and decide if there is a need to update and update the mapping . For each label there are only two possible cases, either is updated or not.
Case 1, is updated: In this case is added to a child of an existing node in the , say . While updating it is ensured that , i.e., is the smallest possible label space that completely subsumes the label space of (lines 1-1). After the update, if there are existing subtrees rooted at , then if the label space of subsumes any of the subtree space, then becomes the root of those subtrees (lines 1-1). In this case the label mapping is updated as , i.e., the label in an individual dataset is mapped to a same label name in UHLS. Additionally, if there exist any other nodes, , we add for all such nodes (lines 1-1). This additional condition ensures that even in the cases where the actual hierarchy will be a directed acyclic graph, we restrict it to a tree hierarchy by adding additional mappings.
Case 2, is not updated: In this case, , i.e, there exists a subset of nodes whose union of label space is equal to the label space of . If , intuitively this means that the label space of is a mixed space, and from some other datasets labels with finer label spaces were added to . If , this means that some other dataset added a label which has the same label space. In this case we will only update the label mapping as (lines 1-1).

In Algorithm 1 whenever a decision has to be made related to a comparison between two label spaces, we refer a domain expert. The expert makes the decision based on the annotation guidelines for the queried label spaces and existing organization of the queried label space in WordNet or Freebase if the queried labels are present in these resources. We argue that since the overall size of is several order of magnitude less than the size of annotated instances (), having a human in the loop preserves the overall semantic property of the tree, which will be exploited by a partial loss function to enable finer prediction across domains. An illustration of UHLS and label mapping is provided in Figure 3.

In the next section, we will describe how the UHLS and the label mapping will be used by a learning model to make fine predictions across datasets.

Result: Unified Hierarchical Label Set (UHLS), and label mapping, .
1 Initialize: for  do
2       if  then // Case 2
4      else // Case 1
5             for  do // Update existing nodes
6                   if  then
9            for  do // Restrict to tree hierarchy
10                   if  then
Algorithm 1 UHLS and label mapping creation algorithm.
Figure 3: An illustration of the unified hierarchical label set and the label mapping from individual datasets. This is a simplified view for illustration purposes.

Learning Model

Our learning model can be decomposed into two parts: (1) Neural Mention and Context Encoders to encode the entity mention and its surrounding context into a feature vector; (2) Unified Type Predictor to infer entity types in the UHLS.

Neural Mention and Context Encoder

The input to our model is a sentence with the start and end index of entity mentions. Following the work of [\citeauthoryearShimaoka et al.2017, \citeauthoryearAbhishek, Anand, and Awekar2017, \citeauthoryearXu and Barbosa2018] we use Bi-directional LSTMs [\citeauthoryearHochreiter and Schmidhuber1997, \citeauthoryearGraves, Mohamed, and Hinton2013] to encode left and right context surrounding the entity mention and used a character level LSTM to encode the entity mention. After this we concatenate the output of the three encoders, to generate a single representation for the input. Let us denote this representation as .

Unified Type Predictor

Given the input representation, , the objective of the predictor is to assign a type from the unified label set . Thus, during model training, using the mapping function we convert individual dataset specific labels to the unified label set, . Due to one to many mapping, now there are multiple positive labels available for each individual input label . Lets call the mapped label set for an input label as . Now, if any of the mapped label has descendants, then the descendants are also added to 2.

For example, if the label GPE from the dataset OntoNotes, is mapped to the label GPE in the UHLS, then GPE as well as all the nodes rooted at GPE are possible candidates for the original label GPE. This is because, even though the original example in OntoNotes is a name of a city, the annotation guidelines restrict the fine-labeling. Thus the mapped set would be updated to {GPE, City, Country, County, …}. Additional, some label have a one-to-many mapping, for example, for the label MISC in CoNLL dataset, the candidate labels could be {product, event, …}.

A partial label loss function will select the best candidate label from the mapped label set during model training. Due to the inherent design of the UHLS and label mapping, there will always be examples available that will be mapped only at a single leaf node. Thus allowing fine labels in the candidate set for actual coarse labels, will encourage model to predict finer labels across datasets.

Partial Hierarchical Label Loss

A partial label loss deals with the situation where training example have a set of candidate labels and among which only a subset is correct for that given example [\citeauthoryearNguyen and Caruana2008, \citeauthoryearCour, Sapp, and Taskar2011, \citeauthoryearZhang, Yu, and Tang2017].

In our case, this situation arises because of the mapping of the individual dataset labels to the UHLS. We use a hierarchy aware partial loss function as proposed in [\citeauthoryearXu and Barbosa2018]. We first compute the probability distribution for the labels available in as described in equation 1. Here is a weight matrix of size and is the input entity mention along with its context.


Then we compute , a distribution adjusted to include a weighted sum of the ancestors probability for each label as defined in equation 2. Here is the set of ancestors of the label in and is a hyperparameter.


Then we normalize . From this normalized distribution, we select a label which has the highest probability and is also a member of the mapped labels . We assumed the selected label to be correct and propagate the log-likelihood loss. The intuition behind this is that given the design of the ULHS and label mapping; there will always be examples where will contain only one element, in that case, the model gets trained for that label. In the case where there are multiple labels, the model has already built a belief about the fine label suitable for that example because of simultaneously training with inputs having a single mapped label. Restricting that belief to the mapped labels encourages correct fine-predictions for these coarsely labeled examples.

Experiments and Analysis

In this section, we evaluate the proposed method using seven real-world publicly available ET datasets.


Table 1 describe the datasets used in this work. We can observe that these datasets cover several domains and none of them have an identical label set. Some datasets capture fine-grained labels while others only have coarse labels. The Wiki [\citeauthoryearLing and Weld2012] dataset is automatically generated using a distant supervision process [\citeauthoryearCraven and Kumlien1999] and has multiple labels per entity mention in its label set. The other remaining datasets have a single label per entity mention in their respective label set.

Dataset Domain No. of Labels Mention count Fine labels Reference
BC5CDR Medical abstract 2 9,385 No [\citeauthoryearLi et al.2016]
CoNLL Newswire 4 23,499 No [\citeauthoryearTjong Kim Sang and De Meulder2003]
JNLPBA Medical abstract 5 46,750 Yes [\citeauthoryearKim et al.2004]
CADEC Social media 5 5,807 Yes [\citeauthoryearKarimi et al.2015]
OntoNotes Newswire, conversations, newsgroups, weblogs 18 1,16,465 No [\citeauthoryearWeischedel et al.2013]
BBN Newswire 71 86,921 Yes [\citeauthoryearWeischedel and Brunstein2005]
Wiki Wikipedia 118 20,00,000 Yes [\citeauthoryearLing and Weld2012]
Table 1: Description of the ET datasets used in this work.

UHLS and Label Mapping

We followed the Algorithm 1 to create the UHLS and the label mapping. To reduce the load on the human experts for verification of the label spaces, we initialized the UHLS with the BBN dataset hierarchy. We keep on updating the initial hierarchy until all the labels from the seven datasets were processed. There were total labels in and in the end had labels. This difference in label count is because many labels were mapped to one or more existing nodes, without the creation of a new node, i.e., case 2 of the UHLS creation process. This indicates the diverse and overlapping nature of the seven datasets. The label set overlap is also illustrated in the Figure 1. The label MISC from CoNLL dataset has the highest, , mappings. Wiki and BBN datasets were the largest contributor towards fine labels with and labels at the leaf of UHLS. There were fine-grained labels common to Wiki and BBN datasets. This indicates that even though these are the fine-grained datasets with one of the largest label sets, each of them provide complementary labels to the UHLS.


We compared our learning model performance with two baseline models under several evaluation schemes described later. The first baseline is a learning model trained only on a single dataset. We name these as silo models. In this baseline, the input is fed through a mention and context encoder, and the output labels are the same as that was available in the original dataset. In the case of a single label dataset, we use a standard softmax based cross-entropy loss. For multi-label datasets, we use a sigmoid based cross-entropy loss.

The second baseline is a learning model trained on a classic hard parameter sharing multi-task learning framework [\citeauthoryearCaruana1997]. In this baseline, all the seven datasets are fed through a common mention and context encoder. For each dataset, there is a separate classifier head with the output labels same as that was available in the respective original dataset. We name this baseline as a multi-head baseline3. Similar to the silo models, the appropriate loss function is selected for each head. The only difference between the silo and multi-head model is the way mention and context representations are learned. In the multi-head model, the representations are shared across datasets. In silo models, the representations are learned separately for each dataset.

Both of these baselines use the same mention and context encoder architecture as used by our model, i.e., the LSTM based encoders.

Model Training

For each of the seven datasets, we use the standard train, validation and testing split. If the standard splits are not available, we randomly split the available data into %, %, and %, and use them as train, validation, and testing set respectively. In the case of the silo model, for each dataset, we train a model on its training split and select the best model using its validation split. In the case of the multi-head and our proposed model, we train the model on the training splits of all seven datasets together and select the best model using the combined validation split. The source code along with the detailed description of the model training procedure and hyperparameters used will be publicly available.

Experimental Setup

In our experimental setup, we have seven silo models, a multi-head model, and our proposed model. We compare these models under two evaluation measures, an idealistic and a realistic measure. In each of these measure, we use two evaluation metrics, a best effort metric and a fine-grained prediction metric.

Evaluation Measures

An idealistic measure assumes that a single dataset is a representative of the ET world. The existing work in ET only evaluates learning models in this narrowly focused measure. We name this measure idealistic because we know during the testing information about the test data domain and the candidate label set. In this measure, given a test dataset, we pick a silo model (or head of the multi-head model) which has been trained on a training dataset with the same characteristics as the test dataset. In this measure, for the silo and multi-head models, if the source label set is coarse-grained, then the predictions are always limited to the coarse label set. In our proposed model, there is no selection step, i.e., all seven test datasets are indistinguishable, as there is only one classifier which always predicts in the UHLS across all datasets.

A realistic measure assumes that a single dataset is a representative of only a small subset of the ET world. We name this measure realistic because we don’t have any information about the test data domain and the candidate label set4. For the silo and multi-head models, the missing information creates a major issue of type assignment. We have multiple silo/head models and for a test entity mention, we need to assign a type from one of these models. In this measure, we pass every test example through all of the silo models and all of the heads of the multi-head model. Each model/head will produce a ranked list of its label with a confidence score. We pick a final label based on the two schemes described below:

Highest confidence label (HCL): We pick a label which has the highest confidence score across all model/head for a test example. If there are ties, then these are resolved using the RHCL scheme.
Relative highest confidence label (RHCL): In this scheme, we re-rank the model/head confidence relative to the random chance. For example, if the model/head is making a prediction from a label set of three with confidence [0.1, 0.2, 0.7], we adjust the confidence of each label prediction relative to the 33.33% chance. The new scores will be [0.3, 0.6, 2.1]. After re-ranking, we pick a label which has the highest relative confidence score across all models/heads for a test example.

Recall that the experimental setup includes multiple models, each having a different label set. The existing classifier integration strategies [\citeauthoryearZhou2012], such as sum rule or majority voting are not suitable for this work because every classifier has a different label set.

Evaluation metrics

In the evaluation measure, there are cases where the label set of the model’s prediction does not match with the label set of the gold dataset. Due to this reason, without re-annotating test portion of all the datasets, we cannot have an exhaustive comparison among models. To overcome this issues, we propose two evaluation metrics through which a comparison can be made with minimum re-annotation effort.

In the first metric, we compute an aggregate micro-averaged F1 score on best effort basis. It is based on the intuition that if the labels are only annotated at a coarse level in the gold test annotations, then even if a model predicts a fine-label within that coarse label, this metric should not penalize such cases5. To find the fine-coarse subtype information, we exploit the UHLS and the label mapping. To compute this metric, we map both prediction and gold label to the UHLS and evaluate in that space. By design, this metric will not capture errors made at a finer level, which the next metric will capture. We compute this metric both in an idealistic and realistic measure. We club together all the test samples from the seven datasets and make a single test dataset. In an idealistic setting, based on the test example, the corresponding silo/head model was picked. In the realistic setting, these samples were indistinguishable.

Figure 4: Comparison of learning models in the idealistic and realistic scenarios using error bar plot.
Figure 5: Analysis of Fine-grained label predictions. The two columns specify results for nationality and sports event label. Each row represents a model used for prediction. The results can be interpreted as, out of 351 entity mentions with type nationality, model Silo (CoNLL) predicted 338 as MISC type and the remaining as other types illustrated.

In the second metric, we measure how good are the fine-grained predictions on examples where the gold dataset has only coarse labels. For example, if the model’s prediction was city on a dataset where all cities and countries are clubbed together under a label GPE. We re-annotate a representative sample of a coarse-grained dataset and evaluate the model’s performance on this sample.


The results for both the idealistic and realistic measures is available in Figure 4.

Idealistic measure

In this measure, we can observe that when the information about the test data characteristics is known, the multi-head model outperforms our proposed approach and the silo models. The primary reason could be that the model has learned better shared representations using the multi-task framework as well as has a independent head for each dataset to learn dataset specific idiosyncrasy. In our proposed model, the overall complexity including the label search space increases compared to both multi-head and silo models. Despite of this increased complexity, we can observe that its performance is competitive.

Realistic measure

In this measure, we can observe that when the information about data characteristics is not available, there is a significant performance drop in the silo and the multi-head models. The primary reason is that because the silo/head models are trained to optimize on a narrowly focused single dataset. If an out of domain example comes, ideally we would assume that only the model trained on the dataset with same characteristics should show a peak probability distribution, and for out of domain mentions the distribution should be diffused. However, in practice, this does not happen, as demonstrated by our experimental setup. Our proposed model has the same performance as that in the idealistic setting because it does not require any dataset characteristics information during testing. This experiment highlights the first issue related to the type assignment. In the cases where several models trained on different datasets are available, it is difficult to assign a single type. Our proposed framework overcomes this difficulty by treating all the available datasets as a single multi-domain dataset with the labels in the UHLS.

Fine-grained predictions

For this analysis, we re-annotate the examples of type MISC from the CoNLL test set into nationality (support of 351), sports event (support of 117) and others (support 234). We analyzed different models prediction for the nationality and sports event label. These two labels are contributed by different datasets, Wiki and BBN respectively. The results are available in Figure 5.

Figure 6: Example output of our proposed approach. Sentence 1, 2, 3 are from CoNLL, BBN and BC5CDR dataset respectively.

We can observe that the Silo (CoNLL) and MH (CoNLL) models, which are trained on the dataset with the same characteristics can accurately predict the entity mentions as MISC, which is not a fine-grained prediction. The Silo (BBN) and MH (BBN) models can predict the nationality type accurately, however, they can only predict a coarse type events other for entity mentions of type sports events. The type event others also include elections. The Silo (Wiki) and MH (Wiki) models can predict the type sports event to some extent but perform miserably on the type nationality. The nationality entity types are majorly predicted as location. In all of the above models, we used some information about the target domain and type when selecting a learning model for prediction. In the realistic measure, we can observe that the Silo (HCL) model can make completely out of scope errors such as prediction of the adverse drug reaction type to sports event mentions. Our proposed approach (UHLS) can predict fine types for both the categories even if these labels were contributed by different datasets.

Our results convey that by pooling diverse datasets, a limitation of one dataset can be covered by some other dataset, and a model trained on an amalgam of diverse datasets generalizes better in the real-world setting.

Example output on different datasets: Figure 6 shows the labels assigned by the proposed approach on the sentences from the CoNLL, BBN and BC5CDR datasets. We can observe that, even though the BBN dataset is fine-grained, it has complementary labels compared with Wiki datasets. For example, for the entity mention Magellan, a label spacecraft was assigned. The proposed approach can aggregate fine-labels across datasets and makes unified predictions. Additionally, even in sentences from biomedical abstracts, the proposed approach is assigning fine-types, which came from a dataset with the social media domain. Also for labels such as MISC, even without adding this label in the UHLS, by one-to-many mapping, suitable fine-type is being assigned.

Related Work

In this section, we will describe the related work where the models are trained on multiple diverse datasets together. To the best of our knowledge, the closest we could find to our work is the work of \citeauthorredmon2017yolo9000, \citeyearredmon2017yolo9000, in the visual object recognition task. They consider two datasets. First a coarse-grained and second, a fine-grained. Label set of the first dataset is assumed to be subsumed by the label set of the second dataset. Thus coarse-grained labels can be mapped to fine-grained dataset labels in a one-to-one mapping. Additionally, they did not propagate the coarse labels to the finer labels. As demonstrated by our experiments, when several real-world datasets are merged, one to one mapping is not possible. In our work, we provide a principled approach where multiple datasets can contribute to fine-grained labels. In our framework, a partial loss function enables fine-label propagation on datasets with coarse labels.

In the area of cross-lingual syntactic parsing, there is a notation of universal POS tagset [\citeauthoryearPetrov, Das, and McDonald2012]. This tagset is a collection of coarse tags that exists in similar form across languages. Utilizing this tagset and a mapping from language-specific fine-tags, it becomes possible to train a single model in a cross-lingual setting. In this case, the mapping is many-to-one, i.e., a fine-category to a coarse category, thus the models are limited to predict a coarse-grained label.

Conclusion and Future Work

The key idea of our paper is that by using in conjunction, a UHLS, one-to-many label mappings, and a partial loss function; we can train a single classifier on several diverse datasets together. Currently, in the UHLS creation process, there is a human judgment involved, which is a one-time effort per new dataset added to the UHLS. This effort is significantly less when compared with annotating multi-domain dataset with fine-grained annotations. Our proposed framework makes it possible to predict fine-grained labels across all the diverse ET datasets.

Our analysis indicates the following observations. First, focusing on each dataset as a world of ET is not suitable for the real-world purpose. Second, in a real world, any combination of domain and label set is possible. Third, cases where source label is a very coarse-grained label, a one-to-many mapping helps to assign finer-labels. Fourth, even if there are two large fine-grained datasets, BBN and Wiki, they both provide complementary information. Fifth, our proposed models enables fine-grained predictions across all datasets by capitalizing on the knowledge learned from labels coming from diverse datasets.

There are several possible extensions of this work. First, by combining several diverse datasets, there is a problem of class imbalance. Currently, we use a naive approach to address this. A more sophisticated approach could further improve performance. Second, there is scope for exploration of methods that can create the UHLS automatically without a human in the loop. Third, the proposed framework can be extended to several other classification tasks.


We thank Riddhiman Dasgupta and Srikanth Tamilselvam for helpful discussions about model training and hierarchy construction.


  1. Throughout the paper entity mentions will be written in typewriter font and entity types will be in italics.
  2. This is exempted when the annotated label is a coarse label and a fine label from the same dataset exist in the subtree.
  3. Here since the “task” is the same, i.e., entity typing, we use the term multi-head instead of multi-task for the baseline.
  4. Here our assumption is that the collection of all of the domains and the label sets present in the available datasets is a representative of the ET world. A test data can have any combination of domain/label set from this world. However, we don’t know about the exact domain/label set.
  5. Exception is where the source dataset also has fine-grained labels.


  1. Abhishek, A.; Anand, A.; and Awekar, A. 2017. Fine-grained entity type classification by jointly learning representations and label embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, 797–807.
  2. Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1247–1250. AcM.
  3. Caruana, R. 1997. Multitask learning. Machine learning 28(1):41–75.
  4. Cour, T.; Sapp, B.; and Taskar, B. 2011. Learning from partial labels. Journal of Machine Learning Research 12(May):1501–1536.
  5. Craven, M., and Kumlien, J. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 77–86. AAAI Press.
  6. Dalton, J.; Dietz, L.; and Allan, J. 2014. Entity query feature expansion using knowledge base links. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 365–374. ACM.
  7. Graves, A.; Mohamed, A.-r.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, 6645–6649. IEEE.
  8. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  9. Karimi, S.; Metke-Jimenez, A.; Kemp, M.; and Wang, C. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics 55:73–81.
  10. Kim, J.-D.; Ohta, T.; Tsuruoka, Y.; Tateisi, Y.; and Collier, N. 2004. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, 70–75. Association for Computational Linguistics.
  11. Li, J.; Sun, Y.; Johnson, R. J.; Sciaky, D.; Wei, C.-H.; Leaman, R.; Davis, A. P.; Mattingly, C. J.; Wiegers, T. C.; and Lu, Z. 2016. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016.
  12. Ling, X., and Weld, D. S. 2012. Fine-grained entity recognition. In AAAI, volume 12, 94–100.
  13. Liu, H., and Singh, P. 2004. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal 22(4):211–226.
  14. Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
  15. Nguyen, N., and Caruana, R. 2008. Classification with partial labels. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 551–559. ACM.
  16. Petrov, S.; Das, D.; and McDonald, R. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012).
  17. Redmon, J., and Farhadi, A. 2017. Yolo9000: Better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525. IEEE.
  18. Roth, B.; Monath, N.; Belanger, D.; Strubell, E.; Verga, P.; and McCallum, A. 2015. Building knowledge bases with universal schema: Cold start and slot-filling approaches. In Proceedings of the Eighth Text Analysis Conference (TAC2015).
  19. Shimaoka, S.; Stenetorp, P.; Inui, K.; and Riedel, S. 2017. Neural architectures for fine-grained entity type classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, 1271–1280.
  20. Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, 142–147. Association for Computational Linguistics.
  21. Vrandečić, D., and Krötzsch, M. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10):78–85.
  22. Weischedel, R., and Brunstein, A. 2005. Bbn pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia 112.
  23. Weischedel, R.; Palmer, M.; Marcus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Franchini, M.; et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
  24. Xu, P., and Barbosa, D. 2018. Neural fine-grained entity type classification with hierarchy-aware loss. arXiv preprint arXiv:1803.03378.
  25. Yaghoobzadeh, Y.; Adel, H.; and Schütze, H. 2016. Noise mitigation for neural entity typing and relation extraction. arXiv preprint arXiv:1612.07495.
  26. Yahya, M.; Berberich, K.; Elbassuoni, S.; and Weikum, G. 2013. Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 1107–1116. ACM.
  27. Zhang, M.-L.; Yu, F.; and Tang, C.-Z. 2017. Disambiguation-free partial label learning. IEEE Transactions on Knowledge and Data Engineering 29(10):2155–2167.
  28. Zhou, Z.-H. 2012. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description