Fine-Grained Entity Type Classification by Jointly Learning Representations and Label Embeddings
Fine-grained entity type classification (FETC) is the task of classifying an entity mention to a broad set of types. Distant supervision paradigm is extensively used to generate training data for this task. However, generated training data assigns same set of labels to every mention of an entity without considering its local context. Existing FETC systems have two major drawbacks: assuming training data to be noise free and use of hand crafted features. Our work overcomes both drawbacks. We propose a neural network model that jointly learns entity mentions and their context representation to eliminate use of hand crafted features. Our model treats training data as noisy and uses non-parametric variant of hinge loss function. Experiments show that the proposed model outperforms previous state-of-the-art methods on two publicly available datasets, namely Figer(gold) and bbn with an average relative improvement of 2.69% in micro-F1 score. Knowledge learnt by our model on one dataset can be transferred to other datasets while using same model or other FETC systems. These approaches of transferring knowledge further improve the performance of respective models.
Entity type classification is the task for assigning types or labels such as organization, location to entity mentions in a document. This classification is useful for many natural language processing (NLP) tasks such as relation extraction , machine translation , question answering  and knowledge base construction .
There has been considerable amount of work on Named Entity Recognition (NER) , which classifies entity mentions into a small set of mutually exclusive types, such as Person, Location, Organization and Misc. However, these types are not enough for some NLP applications such as relation extraction, knowledge base construction (KBC) and question answering. In relation extraction and KBC, knowing fine-grained types for entities can significantly increase the performance of the relation extractor  since this helps in filtering out candidate relation types that do not follow the type constrain. Fine-grained entity types provide additional information while matching questions to its potential answers and significantly improves performance . For example, Li and Roth rank questions based on their expected answer types (will the answer be food, vehicle or disease).
Typically, FETC systems use over hundred labels, arranged in a hierarchical structure. An important aspect of FETC is that based on local context, two different mentions of same entity can have different labels. We illustrate this through an example in Figure ?. All three sentences S1, S2, and S3 mention same entity Barack Obama. However, looking at the context, we can infer that S1 mentions Obama as a person/author, S2 mentions Obama only as a person, and S3 mentions Obama as a person/politician.
Available training data for FETC has noisy labels. Creating manually annotated training data for FETC is time consuming, expensive, and error prone. Note that, a human annotator will have to assign a subset of correct labels from a set of around hundred labels for each entity mention in the corpus. Existing FETC systems use distant supervision paradigm  to automatically generate training data. Distant supervision maps each entity in the corpus to knowledge bases such as Freebase , DBpedia , YAGO . This method assigns same set of labels to all mentions of an entity across the corpus. For example, Barack Obama is a person, politician, lawyer, and author. If a knowledge base has these four matching labels for Barack Obama, then distant supervision assigns all of them to every mention of Barack Obama. Training data generated with distant supervision will fail to distinguish between mentions of Barack Obama in sentences S1, S2, and S3.
Existing FETC systems have one or both of following drawbacks:
We have observed that for real world datasets, more than twenty five percent of training data has noisy labels. First drawback propagates this noise in training data to the FETC model. To extract hand crafted features various NLP tools are used. Since errors inevitably exist in such tools, the second drawback propagates errors of these tools to FETC model.
We propose a neural network based model to overcome the two drawbacks of existing FETC systems. First, we separate training data into clean and noisy partitions using the same method as in AFET system . For these partitions, we use simple yet effective non-parametric variant of hinge loss function while training. To avoid use of hand crafted features, we learn representations for given entity mention and its context.
Additionally, we investigate effectiveness of using transfer learning  for FETC task both at feature and model level. We show that feature level transfer learning can be used to improve performance of other FETC system such as AFET, by up to 4.5% in micro-F1 score. Similarly, model level transfer learning can be used to improve performance of the same model using different dataset by up to 3.8% in micro-F1 score.
Our contributions can be summarized as follows:
We propose a simple neural network model that learns representations for entity mention and its context, and incorporate noisy label information using a variant of non-parametric hinge loss function. Experimental results on two publicly available datasets demonstrate the effectiveness of proposed model, with an average relative improvement of 2.69% in micro-F1 score.
We investigate the use of feature level and model level transfer-learning strategies in the domain of the FETC task. The proposed transfer learning strategies further improve the state-of-the-art on BBN dataset by 3.8% in micro-F1 score.
Ling et al. proposed the first system for FETC task, which used 112 overlapping labels. They used linear classifier perceptron for multi-label classification. Yosef et al. used multiple binary SVM classifiers in a hierarchy, to classify an entity mention to a set of 505 types. While the initial work assumed that all labels present in a training dataset for an entity mention are correct, Gillick et al. introduced context dependent FETC and proposed a set of heuristics for pruning labels that might not be relevant given the entity mention’s local context. Yogatama et al. proposed an embedding based model where user-defined features and labels were embedded into a low dimensional feature space to facilitate information sharing among labels.
Shimaoka et al. proposed an attentive neural network model that used LSTMs to encode entity mention’s context and used an attention mechanism to allow the model to focus on relevant expressions in the entity mention’s context. However, the model assumed that all labels obtained via distant supervision are correct. In contrast, our model does not assume that all labels are correct. To learn entity representation, we propose a scheme which is simpler yet more effective.
Most recently, Ren et al. have proposed AFET, an FETC system. AFET separates the loss function for clean and noisy entity mentions. AFET uses label-label correlation information obtained by given data in its parametric loss function (model parameter ). During inference, AFET uses a threshold to separate positive types from negative types (similarity threshold parameter ). However, AFET’s loss function is sensitive to change in parameters, which are data dependent. Figure ? shows the effect of parameter and , on AFET performance evaluated on different datasets. In contrast, our model uses a simple yet effective variant of hinge loss function. This function does not need to tune the similarity threshold.
Transfer learning is well applied to many NLP applications, such as cross-domain document classification , multi-lingual word clustering  and sentiment classification . Initialization of word vectors with pre-trained word vectors in neural network models can be considered as one of the best example of transfer learning in NLP. Wang et al. provide a broad overview of transfer learning techniques used for language processing.
3The Proposed Model
Our task is to automatically classify type information of entity mentions present in natural language sentences. Figure ? shows a general overview of our proposed approach.
Input: The input to the model is a training and testing corpus consisting of a set of sentences on which entity mentions have been identified. In training corpus, every entity mention will have corresponding labels according to a given hierarchy. Formally, a training corpus consists of a set of sentences, . Each sentence will have one or more entity mentions denoted by , where and denotes indices of start and end tokens, respectively. Set consists of all the entity mentions . For every entity mention , there will be a corresponding label vector , which is a binary vector, where if type is true otherwise it will be zero. denotes the total number of labels in a given hierarchy . Testing corpus will only contain sentences and entity mentions.
Output: For entity mentions in testing corpus , predict their corresponding labels.
3.2Training set partition
Similar to AFET, we partition the mention set of training corpus into two parts, a set , consisting only of clean entity mentions and a set , consisting only of noisy entity mentions. An entity mention is said to be clean if its labels belong to only a single path (not necessary to be leaf) in the hierarchy , that is its labels are not ambiguous; otherwise, it is noisy. For example, as per hierarchy given in figure ?, an entity mention with labels person, artist and politician will be considered as noisy, whereas entity mention with labels person, artist and actor will be considered as clean.
This representation captures information about entity mention’s morphology and orthography. We decompose an entity mention into character sequence, and use a vanilla LSTM encoder  to encode character sequences to a fixed dimensional vector. Formally, for entity mention , we decompose it into a sequence of character tokens , , ,, where denotes the total number of characters present in the entity mention. For entity mention containing multiple tokens, we join these tokens with a space in between tokens. Every character will have corresponding vector representation in a lookup table for characters. The character sequence is then fed one by one to a LSTM encoder, and the final output is used as a feature representation for entity mention . We denote this process by a function , where is the number of dimensions for mention representation. The whole process is illustrated in figure ? (Mention representation).
Context representation: This representation captures information about the context surrounding the entity mention. Context representation is further divided into two parts, left and right context representation. The left context consists of a sequence of tokens within a sentence from the start of a sentence till the last token of entity mention. The right context consists of a sequence of tokens from the start of entity mention till the end of a sentence. We use bi-directional LSTM encoders  to encode token level sequences of both context to a fixed dimensional vector. Formally, for an entity mention present in a sentence , decompose into a sequence of tokens , , , for the left context, and , , , for the right context, where denotes the number of tokens in the sentence. Every token will have a corresponding vector representation in a lookup tables for token. The token sequence is then fed one by one to a bi-directional LSTM encoder, and the final output will be used as feature representation. We denote this whole process by function for computing left context and for computing right context. and are the number of dimensions for the left context and the right context representation, respectively. The whole process is illustrated in figure ? (Left and right context representation).
The context representation described above is slightly different from what was proposed in , here we include entity mention tokens within both left and right context, to explicitly encode context relative to an entity mention.
In the end, we concatenate entity mention and its context representation into a single dimensional vector, where . This complete process is denoted by a function given by:
where denotes vector concatenation. For brevity, we will now omit the use of subscript from and , and will use to denote feature representation for entity mention and its context obtained via equation 1.
3.4Feature and label embeddings
Similar to Yogatama et al. and Ren et al. , we embed feature representations and labels in a same dimensional space such that an object is embedded closer to the objects that share similar types than the objects that do not. Formally, we are trying to learn linear mapping functions and , where is the size of embedding space. These mappings are given by:
where, and are projection matrices for features representations and type labels respectively and is one-hot vector representation for label .
We assign a score to each label type and feature vector as a dot product of their embeddings. Formally, we denote a score as:
We use two different loss functions to model clean and noisy entity mentions. For clean entity mentions, we use a hinge loss function. The intuition is simple: maintain a margin, centered at zero, between positive and negative type scores. The scores are computed by similarity between an entity mention and label types (eq. 3). Hinge loss function has two advantages. First, it intuitively seprates positive and negative labels during inference. Second, it is independent of data dependent parameter. Formally, for a given entity mention and its label we compute the associated loss as given by:
where and are set of indices that have positive and negative labels respectively.
For noisy entity mentions, we propose a variant of a hinge loss where, like , score for all negative labels should go below . However, for positive labels, as we don’t know which labels are relevant to entity mention’s local context, we propose that the maximum score from the set of given positive labels should be greater than one. This maintains a margin between all negative types and the most relevant positive type. Formally, noisy label loss, is defined as:
Again, using this loss function makes it intuitive to set a threshold of zero during inference.
These loss functions are different from the loss functions used in  in a way that, we make strict absolute criteria to distinguish between positive and negative labels. Whereas in  positive labels should have a higher score than negative labels. As their scoring is relative, the final result varies on the threshold used to separate positive and negative labels.
To train the partitioned dataset together, we formulate the joint objective problem as:
where is the collection of all model parameters that needs to be learned. To jointly optimize the objective , we use Adam , a stochastic gradient-based optimization algorithm.
For every entity mention in set from , we perform a top-down search in the given type hierarchy , and estimate the correct type path . Starting from the tree root, we recursively compute the best type among node’s children by computing its score with obtained feature representations. We select the node that has maximum score among other nodes. We continue this process till a leaf node is encountered or the score associated with a node falls below an absolute threshold zero. The thresold is fixed across all datasets used.
We want to investigate, whether the feature representations learnt for an entity mention are useful. We study what contribution these feature representations make to an existing feature engineering based method such as AFET. We learn the proposed model on one training dataset, namely Wiki dataset, which has the highest number of entity mentions among other datasets and use this model to generate representations that is for another training and testing data. These representations, which are dimensional vectors, are used as feature for an existing state-of-the-art model, AFET, in place of the hand-crafted features that were originally used. AFET model is then trained using these feature representations. We call this as feature level transfer learning. On the other hand, we also evaluate model level transfer learning, where we initialize weights of LSTM encoders for a new dataset with the weights learnt from the model trained on another dataset, namely Wiki dataset.
We evaluate the proposed model on three publicly available datasets, provided in a pre-processed tokenized format by Ren et al. . Statistics of the datasets used in this work are shown in Table ?. The details of the datasets are as follows:
Wiki/Figer(gold): The training data consists of Wikipedia sentences and was automatically generated in distant supervision paradigm, by mapping hyperlinks in Wikipedia articles to Freebase. The test data, mainly consisting of sentences from news reports, was manually annotated as described in .
OntoNotes: OntoNotes dataset consists of sentences from newswire documents present in OntoNotes text corpus . DBpedia spotlight  was used to automatically link entity mention in sentences to Freebase. For this corpus, manually annotated test data was shared by Gillick et al. .
BBN: BBN dataset consists of sentences from Wall Street Journal articles and is completely manually annotated .
Please refer to  for more details of the datasets.
We compared the proposed model with state-of-the-art entity classification methods
We compare these baselines with variants of our proposed model: (1) our: complete model; (2) our-AllC assuming all mentions are clean; (3) our-NoM without mention representation.
We use Accuracy or Strict-F1 score, Macro-averaged F1 score, and Micro-averaged F1 score as metrics for evaluation. Existing methods for FETC use same measures . We removed entity mentions that do not have any label in training as well as test set. We also remove entity mentions that have spurious indices (i.e entity mention length of ).
Hyperparameter setting: All the neural network based models in this paper used 300 dimensional pre-trained word embeddings distributed by Pennington et al. . The hidden-layer size of word level bi-directional LSTM was 100, and that of character level LSTM was 200. Vectors for character embeddings were randomly initialized and were of size 200. We use dropout with the probability of 0.5 on the output of LSTM encoders. The embedding dimension used was 500. We use Adam  as optimization method with learning rate of 0.0005-0.001 and mini-batch size in the range of 800 to 1500. The proposed model and some of the baselines were implemented using TensorFlow
In feature level transfer learning, we use the best performing proposed model trained on Wiki dataset to generate representations that is dimensional vector for every entity mention present in the train, development, and test set of the BBN and the OntoNotes dataset. Figure ? illustrates an example for the encoding process. Then we use these representations as a feature vector in place of the user-defined features and train the AFET model. Its hyper-parameters were tuned on the development set. These results are shown in table ? as feature level transfer-learning.
In model level transfer learning, we use the learnt weights of LSTM encoders from the best performing proposed model trained on Wiki dataset and initialize the LSTM encoders of the same model with these weights while training on BBN and OntoNotes datasets. These results are shown in table ? as model level transfer learning.
4.4Performance comparison and analysis
Table ? shows the results of the proposed method, its variants and the baseline methods.
Comparison with other feature learning methods: The proposed model and its variants (our-AllC, our-NoM) perform better than the existing feature learning method by Shimaoka et al. (Attentive), consistently on all datasets. This indicates benefits of the proposed representation scheme and joint learning of representation and label embedding.
Comparison with feature engineering methods: The proposed model performs better than the existing feature engineered methods (FIGER, HYENA, AFET-NoCo, AFET-CoH) consistently across all datasets on Micro-F1 and Macro-F1 evaluation metrics. These methods do not model label-label correlation based on data. In comparison with AFET, the proposed model outperforms AFET on Wiki and BBN dataset in terms of Micro-F1 evaluation metric. This indicates benefits of feature learning as well as data driven label-label correlation. We do a type-wise performance comparison on OntoNotes dataset in subsection 4.5.
Comparison with variants of our model: The proposed model performs better on all dataset as compared to our-AllC in terms of micro-F1 score. However, we find the performance difference on Wiki and OntoNotes dataset is not statistically significant. We investigated it further and found that across all three datasets, there exist only few entity types for which more than 85% of entity mentions are noisy. These types consist of approximately 3-4% of test set, and our model fails on these types (zero micro-F1 score). However, our-AllC performs relatively well on these types. Examples of such types are: /building, /person/political_figure, /GPE/STATE_PROVINCE. This indicates two limitations of the proposed model. First, the separating of clean and noisy mentions based on the hierarchy has its own inherent limitation of assuming labels within a path are correct. Second, our model learns better if more clean examples are available at the cost of not learning very noisy types. We will try to address these limitations in our future work. Compared with our-NoM, the proposed model performs slightly better across all datasets in terms of micro-F1 score.
Feature level transfer learning analysis: We observed performance increase in micro-F1 score of AFET on BBN dataset, after replacing hand-crafted features with feature representations generated by the proposed model. This indicates usefulness of the learnt feature representations. However, if we repeat the same process with OntoNotes dataset, there is only a subtle change in performance. This is majorly because of the data distribution of OntoNotes dataset is different from that of Wiki dataset. This issue is discussed in the next subsection.
Model level transfer learning analysis: In model level transfer learning, sharing knowledge from similar dataset (Wiki to BBN) increases the performance by 3.8% in terms of micro-F1 score. However, sharing knowledge from Wiki to OntoNotes dataset slightly increases the performance by 0.4% in terms of micro-F1 score.
4.5Case analysis: OntoNotes dataset
We observed three things; (i) all models perform relatively poor on OntoNotes dataset compared to their performance on other two datasets; (ii) the proposed model outperforms other models including AFET on the other two datasets, but gave worse performance on OntoNotes dataset; (iii) the two variants of transfer learning significantly improve performance of the proposed model on the BBN dataset but resulted in only a subtle performance change on OntoNotes dataset.
Statistics of the dataset (Table ?) indicates that presence of pronominal or other kinds of mentions are relatively higher in OntoNotes ( in test set) than the other two datasets ( in test set). Examples of such mentions are 100 people, It, the director, etc. Table ? shows 20 randomly sampled entity mentions from test set of OntoNotes datasets. Some of these mentions are very generic and likely to be dependent on previous sentences. As all the methods use features solely based on the current sentence, they fail to transfer cross-sentence boundary knowledge. Removing pronominal mentions from test set increases the performance of all feature learning methods by around 3%.
Next we analyse where the proposed model is failing as compared to AFET. For this, we look at type-wise performance for the top-10 most frequent types in the OntoNotes test dataset. Results are shown in Table ?. Compared to AFET, the proposed model performs better in all types except other in the top-10 frequent types. The other type, which is dominant in test set ( of entity mentions are of type other) and is a collection of multiple broad subtypes such as product, event, art, living_thing, food. Performance of AFET significantly drops (AFET-NoCo) when data-driven label-label correlation is ignored, which indicates that modeling data-driven correlation helps. However, as shown in Figure ?, the use of label-label correlation depends on appropriate values of parameters which vary from one dataset to another.
5Conclusion and Future Work
In this paper, we propose a neural network based model for the task of fine-grained entity classification. The proposed model learns representations for entity mention, its context and incorporate label noise information in a variant of non-parametric hinge loss function. Experiments show that the proposed model outperforms existing state-of-the-art models on two publicly available datasets without explicitly tuning data dependent parameters.
Our analysis indicates the following observations. First, OntoNotes dataset has a different distribution of entity mentions compared with other two datasets. Second, if data distribution is similar, then transfer learning is very helpful. Third, incorporating data-driven label-label correlation helps in the case of labels of mixed types. Fourth, there is an inherent limitation in assuming all labels to be clean if they belong to the same path of the hierarchy. Fifth, the proposed model fails to learn label types that are very noisy.
Future work could analyse the effect of label noise reduction techniques on the proposed model, revisiting the definition of clean and noisy labels and modeling label-label correlation in a principled way that is not dependent on dataset specific parameters.
We thank the anonymous reviewers for their invaluable and insightful comments. Abhishek is supported by MHRD fellowship, Government of India. We acknowledge the use of computing resources made available from the Board of Research in Nuclear Science (BRNS), Dept. of Atomic Energy (DAE), Govt. of India sponsered project (No.2013/13/8-BRNS/10026) by Dr. Aryabartta Sahu at Department of Computer Science and Engineering, IIT Guwahati.
- Whenever possible, the baselines result are reported from , otherwise we re-implemented baseline methods based on description available in corresponding papers.
- The code to replicate the work is available at https://github.com/abhipec/fnet
- *These results are from  that also uses 10% of the test set as development set and the remaining for evaluation.
- We used the publicly available code distributed by Ren et al. .
- All of these results are on exact same train, development and test set.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data.
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge.
Michael Collins and Yoram Singer. Unsupervised models for named entity classification.
Mark Craven and Johan Kumlien. Constructing biological knowledge bases by extracting information from text sources.
Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. Improving efficiency and accuracy in multilingual entity extraction.
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion.
Li Dong, Furu Wei, Hong Sun, Ming Zhou, and Ke Xu. A hybrid neural model for type classification of entity mentions.
Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh. Context-dependent fine-grained entity type tagging.
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S. Weld. Type-aware distantly supervised relation extraction with linked arguments.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation.
Xin Li and Dan Roth. Learning question classifiers.
Thomas Lin, Mausam, and Oren Etzioni. No noun phrase left behind: Detecting and typing unlinkable entities.
Xiao Ling and Daniel S. Weld. Fine-grained entity recognition.
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit.
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant supervision for relation extraction without labeled data.
T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning.
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.
Lorien Y. Pratt. Discriminability-based transfer between neural networks.
Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition.
Xiang Ren, Wenqi He, Meng Qu, Lifu Huang, Heng Ji, and Jiawei Han. Afet: Automatic fine-grained entity typing by hierarchical partial-label embedding.
Lei Shi, Rada Mihalcea, and Mingjun Tian. Cross language text classification by model translation and semi-supervised learning.
Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. An attentive neural architecture for fine-grained entity type classification.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge.
Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. Cross-lingual word clusters for direct transfer of linguistic structure.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition.
Dong Wang and Thomas Fang Zheng. Transfer learning for speech and language processing.
Ralph Weischedel and Ada Brunstein. BBN Pronoun Coreference and Entity Type Corpus LDC2005T33.
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. Ontonotes release 5.0 ldc2013t19.
Dani Yogatama, Daniel Gillick, and Nevena Lazic. Embedding methods for fine grained entity type classification.
Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, and Gerhard Weikum. HYENA: Hierarchical type classification for entity names.