Falcon 2.0: An Entity and Relation Linking Tool over Wikidata
Natural Language Processing (NLP) tools and frameworks have significantly contributed with solutions to the problems of extracting entities and relations and linking them to the related knowledge graphs. Albeit effective, the majority of existing tools are available for only one knowledge graph. In this paper, we present Falcon 2.0, a rule-based tool capable of accurately mapping entities and relations in short texts to resources in both DBpedia and Wikidata following the same approach in both cases. The input of Falcon 2.0 is a short natural language text in the English language. Falcon 2.0 resorts to fundamental principles of the English morphology (e.g., N-Gram tiling and N-Gram splitting) and background knowledge of labels alignments obtained from studied knowledge graph to return as an output; the resulting entity and relation resources are either in the DBpedia or Wikidata knowledge graphs. We have empirically studied the impact using only Wikidata on Falcon 2.0, and observed it is knowledge graph-agnostic, i.e., the Falcon 2.0 performance and behavior are not affected by the knowledge graph used as background knowledge. Falcon 2.0 is public and can be reused by the community. Additionally, Falcon 2.0 and its background knowledge bases are available as resources at https://labs.tib.eu/falcon/falcon2/.
Keywords:NLP Entity Linking Relation Linking Background Knowledge English morphology DBpedia Wikidata
Resource Type: APIs and software frameworks Repository: https://github.com/SDM-TIB/Falcon2.0 Web API https://labs.tib.eu/falcon/falcon2/ License: GNU General Public License v3.0
Entity linking (EL) -interchangeably used as Named Entity Disambiguation (NED)- is a well-studied research domain for aligning unstructured text to its structured mentions in various knowledge repositories (e.g., Wikipedia, DBpedia , Freebase  or Wikidata ). Entity linking comprises two sub-tasks. The first task is named entity recognition (NER), in which an approach aims at identifying entity labels (or surface forms) in an input sentence. Entity disambiguation is the second sub-task where the goal is to link entity surface forms to semi-structured knowledge repositories.
With growing popularity’s of publicly available knowledge graphs, researchers have developed several approaches and tools for entity linking over knowledge graphs. Some of these approaches implicitly perform the NER task and directly provided mentions of entity surface forms in the sentences into the knowledge graph (often referred to as an end-to-end EL approaches). Other attempts (e.g., MAG ) assume recognized surface forms of the entities as additional inputs besides the input sentence to perform entity linking. Irrespective of the input format and underlying technologies, the majority of the existing attempts in the EL research domain are confined to well-structured knowledge graphs such as DBpedia , Freebase, and Yago. These knowledge graphs relies on a well-defined process of extracting information directly from the Wikipedia info boxes and do not provide direct access to the users to add/delete the entities. Wikidata, on the other hand, also allows users to edit Wikidata pages directly, add newer entities, and define new relations between the objects. The popularity of Wikidata can be measured by the fact that since its launch in 2012, over 1 billion edits have been made by the users across the world
Motivation, Approach, and Contributions.
We motivate our work by the fact that in spite of the vast popularity of Wikidata, there are limited attempts to target entity linking over Wikidata. In this paper, we focus on providing Falcon 2.0 -a tool for joint entity and relation linking framework over Wikidata that provides Wikidata mentions of entity and relation surface forms in a short sentence. In our previous work, we proposed Falcon , a rule based approach for entity and relation linking of short text Over DBpedia. Falcon has two novel concepts: 1) A linguistic based approach that relies on several English morphology principles such as tokenization, and N-gram tiling; 2) a knowledge graph which serves as a source of background knowledge. This knowledge graph is a collection of entities from DBpedia enriched with the Wikidata labels. We resort to the Falcon approach for developing Falcon 2.0. Hence, we do not claim novelty in the underlying linguistic based approach for Falcon 2.0. However, for Falcon 2.0, we extend the background knowledge graph of Falcon and enrich it with the Wikidata entities and associated alias labels. In this paper, we propose following two reusable,open source, and easily accessible resources:
Falcon 2.0: We propose Falcon 2.0- a tool for joint entity and relation linking over Wikidata. Falcon 2.0 relies on fundamental principles of English morphology (tokenization and compounding) and links entity and relation surface forms in a short sentences to its Wikidata mentions. Falcon 2.0 is available as an online API and can be accessed at https://labs.tib.eu/falcon/falcon2/. We empirically evaluated Falcon 2.0 on a question answering datasets tailored for Wikidata, and Falcon 2.0 significantly outperform the baseline. For the ease of use, we integrate Falcon API
2into Falcon 2.0 and users can also get corresponding DBpedia URIs of entities and predicate present in an input short text.
Falcon 2.0 Background Knowledge Base: We replaced the background knowledge base of Falcon with the new Background KG specially tailored for Wikidata. We extracted 48,042,867 Wikidata entities from its public dump and aligned these entities with the aliases present in the Wikidata. For example, Barack Obama is a Wikidata entity Wiki:Q76. We created a mapping between the label (Barack Obama) of Wiki:Q76
3with its aliases such as President Obama, Barack Hussein Obama, Barry Obama and stored it in the background knowledge graph. We did a similar alignment for 15,645 properties/relations of Wikidata. The background knowledge graph is an indexed graph and can be easily queried using ElasticSearch.
The rest of this paper is organized as follows: the next section describes our two resources and approach to build Falcon 2.0. Section 2 presents the importance and impact of this work for the research community. Section 3 presents experiments to evaluate the performance of Falcon 2.0. The availability and sustainability of resources is explained in Section 5 and its maintenance related discussion is presented in Section 6. Section 7 reviews the state of the art, and we close with the conclusion and future work in Section 8.
2 Falcon 2.0
In this section, we present Falcon 2.0. We first explain the architecture of Falcon 2.0. Next, we discuss the background knowledge used to match the surface forms in the text to resources in a specific knowledge graph.
The Falcon 2.0 architecture is depicted in Figure 1. Falcon 2.0 receives as short input texts and outputs a set of entities and relations extracted from the text; each entity and relation in the output is associated with a unique IRI in Wikidata. Falcon 2.0 resorts to a background knowledge and a catalog of rules for performing entity and relation linking. The background knowledge combines Wikidata labels and their corresponding aliases. Additionally, it comprises alignments between nouns and entities in Wikidata knowledge graph. Alignments are stored in a text search engine, e.g., ElasticSearch, while the knowledge source is maintained in an RDF triple store accessible via a SPARQL endpoint. The rules that represent the English morphology are maintained in a catalog; a forward chaining inference process is performed on top of the catalog during the tasks of extraction and linking. Falcon 2.0 also comprises several modules that identify and link entities and relations to Wikidata knowledge graph. These modules implement POS Tagging, Tokenization & Compounding, N-Gram Tiling, Candidate List Generation, Matching & Ranking, Query Classifier, and N-Gram Splitting and reused from the implementation of Falcon.
2.2 Background Knowledge
Wikidata contains over 52 million entities and 3.9 billion facts (consisting of subject-predicate-object triples). A significant portion of this extensive information is not useful for entity and relation linking. Therefore, we sliced Wikidata and extracted all the entity and relation labels to create a local background knowledge graph. For example, the entity United States of America
2.3 Catalog of Rules
Falcon 2.0 is a rule-based approach. A catalog of rules is predefined to extract entities and relations from the text. The rules are based on the English morphological principles. For example, Falcon 2.0 excludes all verbs from the entities candidates list based on the rule verbs are not entities. For example, the N-Gram tiling module in the Falcon 2.0 architecture resorts to the rule: entities with only stopwords between them are one entity. Another example of such rule When -> date, Where -> place solves the ambiguity of matching the correct relation in case the short text is a question by looking at the questions headword. Some question words determine the range of the relation, which solves the ambiguity. For example, give the two questions When did Princess Diana die? and Where did Princess Diana die?, the relation died can be the death place or the death year. The question headword (When/Where) is the only insight to solve the ambiguity here. When the question word is where Falcon 2.0 matches only relations that have a place as a range of the relation.
Extraction phase in Falcon 2.0 consists of three modules. POS tagging, tokenization & compounding, and N-Gram tiling. The input of this phase is the natural language text. The output of the phase is the list of surface forms that are related to entities or relations.
Part-of-speech tagging (POS) Tagging
receives the natural language text as an input. Then it tags each word in the text with its related tag, e.g., noun, verb, and adverb. This module differentiates between nouns and verbs with the aim of enabling the application of the morphological rules from the catalog. The output of the module is a list of the pairs of (word, tag).
Tokenization & Compounding
builds the tokens list by removing the stopwords from the input and splitting verbs from nouns. For example, if the input is What is the operating income for Qantas, the output of this module is a list of three tokens [operating, income, Qantas].
module combines tokens which have only stopwords between them relying on one of the rules from a catalog of rules. For example, if we consider the output of the previous module as an input for the n-gram tilling module, operating and income tokens will be combined in one token. The output of the module is a list of two tokens [operating income, Qantas].
The linking phase consists of four modules — candidate list generation, matching & ranking, relevant rule selection, and n-gram splitting.
Candidate List Generation
receives the output of the recognition phase. The module queries the text search engine for each token. Then, tokens will have its associated candidate list of resources. For example, the retrieved candidate list of the token operating income is [(P3362, operating income), (P2139, income), (P3362, operating profit)]; where the first element is the Wikidata predicate identifier and the second one is associated labels of the predicates which matched the query ”operating income”.
Matching & Ranking
rank the candidate list received from the candidate list generation module and match candidates’ entities and relations. Since, in any knowledge graph, the facts are represented as triples, the matching and ranking module creates triples consisting of the entities and relations from the candidates’ list. Then, for each pair of entity and relation, the module checks if the triple exists in the RDF triple store (Wikidata). The checking is done by executing a simple ASK query over the RDF triple store. For each existing triple, the module increases the rank of the involved relations and entities. The output of this module is ranked and sorted the list of candidates.
Relevant Rule Selection
interacts with the matching & ranking module by suggesting increasing the ranks of some candidates relying on the catalog of rules. One of the suggestions is considering the question headword to clear the ambiguity between two relations based on the range of relations in the knowledge graph. For example, if the question word is ”where”, then the relation to be recognized should be linked to a property in the knowledge graph with the range ”place”.
is called if none of the triples tested in the matching & ranking modules exists in the triple store, i.e., the compounding the approach did in the tokenization & compounding module led to combining two separated entities. The module splits the tokens from the right side and passes the tokens again to the candidate list generation module. Splitting the tokens from the right side resorts to one of the fundamentals of the English morphology; the compound words in English have their headword always towards the right side .
Text Search Engine
stores all the alignments of the labels. ElastisSearch  is used as the text search engine. It receives a token as an input, then returns all the related resources which have labels similar to the received token.
RDF Triple store
can be seen as a local copy of Wikidata endpoint. It consists of all the RDF triples of Wikidata labeled with the English language. An RDF triple store is used to check the existence of the triples passed from the Matching & Ranking module. The RDF triple store keeps around 3.9 billion triples.
3 Experimental Study
We report on the following metrics: Precision, Recall, and F-measure. Precision is the fraction of relevant resources among the retrieved resources (Equation 1).
Recall is the fraction of relevant resources that have been retrieved over the total amount of relevant resources (Equation 2).
F-Measure or F-Score is a measure that combines Precision and Recall; it is the harmonic mean of precision and recall (Equation 3).
We relied on two different question answering datasets namely SimpleQuestion dataset (annotated-wd-data-test-answerable) for Wikidata  and LC-QuAD 2.0 . SimpleQuestion dataset contains 5,622 test questions which are answerable using Wikidata as underlying knowledge graph. We randomly selected 1,000 questions from LC-QuAD 2.0 to test the robustness of our tool on complex questions.
We chose OpenTapioca  as our baseline for entity linking. OpenTapioca is available as a web API; it can provide Wikidata URIs for relations and entities. We are not aware of any other tool/approach that provides end-to-end Wikidata entity linking.
A laptop machine, with eight cores and 16GB RAM running Ubuntu 18.04 is used for implementing Falcon 2.0. We deployed its web API on a server with 723GB RAM, 96 cores (Intel(R) Xeon(R) Platinum 8160CPU with 2.10GHz) running Ubuntu 18.04. This publicly available API is used to calculate standard metrics of Precision, Recall, and F-score.
3.1 Experimental Results
|OpenTapioca||SimpleQuestion Uppercase Entities||0.16||0.28||0.20|
|Falcon 2.0||SimpleQuestion Uppercase Entities||0.38||0.44||0.41|
Experimental Results 1
In the first experiment, we compare entity linking performance of Falcon 2.0 with the baseline OpenTapioca. We first chose the SimpleQuestion dataset. Surprisingly, we observe that for the baseline, the values are approximately 0.0 for Precision, Recall, and F-score. We analyzed the source of errors, and found that out of 5,622 questions, only 246 have entity labels in uppercase letters. Opentapioca cannot recognize entities and link any entity written in lowercase letters. Case sensitivity is a common issue for entity linking tools over short text as reported by Singh et al. [23, 24] in a detailed analysis. For the remaining 246 questions, only 70 gives the correct answer for OpenTapioca. Given that OpenTapioca finds limitation in lowercase letters of entity surface forms, we evaluated FALCON2.0 and OpenTapioca with the 246 questions of SimpleQuestion to make it a fair evaluation for the baseline. OpenTapioca reports F-score 0.20 on the subset of SimpleQuestion. On the other hand, Falcon 2.0 reports F-score 0.41 on the same dataset (cf. Table 1). Because of the limitation of OpenTapioca in lowercase letters, we randomly selected 1,000 questions from LC-QuAD 2.0 dataset and compared it against Falcon 2.0. OpenTapioca reports F-score 0.31 against Falcon 2.0 with F-score 0.69 reported in Table 1.
Compare to Falcon, Falcon 2.0 has drop in its performance (please see  for detailed performance analysis of Falcon). We analyzed the source of errors. The first source of error is the dataset(s) itself. In both datasets, many questions are grammatically incorrect. For example, where was hank cochran birthed is one of the questions of SimpleQuestion dataset. Falcon 2.0 resorts to fundamental principles of the English morphology and overcomes the state of the art in the task of recognizing entities in grammatically correct questions. The same issue persists in LC-QuAD 2.0, where a large portion of the dataset has grammatically incorrect questions. Furthermore, in questions such as i) Tell me art movement whose name has the word yamato in it, ii) which doctrine starts with the letter t, there is no clear Wikidata relation. Our tool is not able to identify any entity.
Figure 6 provides a more detailed description of the results of this experiment. As observed in Figure 6, the number of questions that have Recall equal to 0.0 for Falcon 2.0 is much less than the ones processed by OpenTapioca. As we mentioned before, OpenTapioca cannot recognize entities that are not uppercase, which explains the high number of questions that have Recall equal to 0.0. While Falcon 2.0 is able to recognize lower case entities.
|Falcon 2.0||SimpleQuestion Uppercase Entities||0.30||0.34||0.32|
Experimental Results 2:
In the second experiment, we evaluate relation linking performance of Falcon 2.0. We are not aware of any other baseline for relation linking over Wikidata. Table 2 summarizes relation linking performance. For relation linking, the performance of Falcon 2.0 reports comparable performance with Falcon over DBpedia .
In August 2019, Wikidata became first Wikimedia project that crossed 1 billion edits and there are over 20,000 active Wikidata editors
5 Adoption and Reusability
Falcon 2.0 is open source and code is available in our public Github: https://github.com/SDM-TIB/Falcon2.0 for reusability and reproducibility. It is currently available for the English language. However, there is no assumption in the approach or while building the background knowledge base that restricts its adaptation or extensibility in other languages. The background knowledge of Falcon 2.0 is available for the community. The background knowledge consists of 48042867 alignments of Wikidata entities and 15645 alignments for Wikidata predicates. GNU General Public License v3.0 allows for the free distribution and re-usage of Falcon 2.0. We hope the research community and industry practitioners will use Falcon 2.0 resources and for various usages such as linking entities and relations to Wikidata, annotating an unstructured text, developing new low language resources, and others.
6 Maintenance and Sustainability
Falcon 2.0 is released as a publicly available resource offering of the Scientific Data Management(SDM) group at TIB, Hannover
7 Related Work
There are several surveys that provides a detailed overview on the advancements of the techniques employed in entity linking over knowledge graphs [22, 2]. Various reading lists ,online forums
There are concrete evidence in the literature that the machine learning based models trained over generic datasets such as WikiDisamb30 , CoNLL (YAGO)  do not perform well when applied to the short texts. Singh et. al. evaluated over 20 entity linking tools for short text -questions in this case- and concluded that issues like capitalization of surface forms, implicit entities, and multi word entities, affect performance of EL tools in an input short text. Falcon  addresses specific challenges of short texts by applying a rule based approach for EL over DBpedia. Falcon not only links entities to DBpedia, but also provides DBpedia URIs of the relations in a short text. EARL  is another tool that proposes a traveling salesman algorithm based approach for joint entity and relation linking over DBpedia. Besides EARL and Falcon, we are not aware of any other tool that provides joint entity and relation linking.
Entity linking over Wikidata is relatively new domain. Cetoli et al.  propose a neural network based approach for linking entities to Wikidata. Authors also align an existing Wikipedia corpus based dataset to Wikidata. However, this work only targets entity disambiguation and assumes that the entities are already recognized in the sentences. Arjun  is the latest work for Wikidata entity linking and use attention based neural network for linking Wikidata entity labels. OpenTapioca  is another attempt which performs end-to-end entity linking over Wikidata; it is the closest to our work even though OpenTapioca does not provide Wikidata Ids of relations in a sentence. OpenTapioca is also available as an API, and it is utilized as our baseline.
8 Conclusion and Future Work
We presented the resource Falcon 2.0, a rule-based entity and relation linking tool able to recognize entities and relations in short text, and to link them to existing knowledge graph, e.g., DBpedia and Wikidata. Although, there are various approaches for entity and relation linking to DBpedia, Falcon 2.0 is one the few tools able to perform this task over Wikidata. Thus, given the number of facts -generic and domain specific- that compose Wikidata, Falcon 2.0 has the potential of impacting on researchers and practitioners that resort to NLP tools for transforming semi-structured data into structured facts. Falcon 2.0 is open source, and the API is publicly accessible and maintained in the servers of the TIB labs
This work has received funding from the EU H2020 Project No. 727658 (IASIS).
- Rematch  have used Levenshtein algorithm for the task of relation linking
- (2007) DBpedia: A Nucleus for a Web of Open Data. In ISWC, pp. 722–735. Cited by: §1.
- (2018) Entity-oriented search. Springer Open. Cited by: §7.
- Joint entity and relation linking using earl. Cited by: §7.
- (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pp. 1247–1250. External Links: Cited by: §1.
- (2018) Neural collective entity linking. Vol. abs/1811.08603. External Links: Cited by: §7.
- (2019) A neural approach to entity linking on wikidata. In European Conference on Information Retrieval, pp. 78–86. Cited by: §7.
- (2019) OpenTapioca: lightweight entity linking for wikidata. arXiv preprint arXiv:1904.09131. Cited by: §3, §7.
- (2017) Question answering benchmarks for wikidata. Cited by: §3.
- (2019) Lc-quad 2.0: a large dataset for complex question answering over wikidata and dbpedia. In International Semantic Web Conference, pp. 69–78. Cited by: §3.
- (2010) TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, pp. 1625–1628. External Links: Cited by: §7.
- (2017) Deep joint entity disambiguation with local neural attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2619–2629. Cited by: §7.
- (2015) Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. ” O’Reilly Media, Inc.”. Cited by: §2.5.
- (2011) Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 782–792. External Links: Cited by: §7.
- (2018) A sequence learning method for domain-specific entity linking. In Proceedings of the Seventh Named Entities Workshop, pp. 14–21. External Links: Cited by: §7.
- (2019-01)(Website) External Links: Cited by: §7.
- (2018) End-to-end neural entity linking. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 519–529. Cited by: §7.
- (2017) MAG: a multilingual, knowledge-base agnostic and deterministic entity linking approach. In Proceedings of the Knowledge Capture Conference, pp. 9. Cited by: §1.
- (2017) Matching Natural Language Relations to Knowledge Graph Properties for Question Answering. In Proceedings of the 13th International Conference on Semantic Systems, SEMANTICS 2017, Amsterdam, The Netherlands, September 11-14, 2017, pp. 89–96. Cited by: footnote 6.
- (2019) Context-aware entity linking with attentive neural networks on wikidata knowledge graph. arXiv preprint arXiv:1912.06214. Cited by: §7.
- (2018) DeepType: multilingual entity linking by neural type system evolution. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §7.
- (2019) Old is gold: linguistic driven approach for entity and relation linking of short text. In Proceedings of the 2019 NAACL HLT (Long Papers), pp. 2336–2346. Cited by: §1, Figure 1, §3.1, §3.1, §7.
- (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27 (2), pp. 443–460. Cited by: §7.
- (2018) No one is perfect: analysing the performance of question answering components over the dbpedia knowledge graph. arXiv:1809.10044. Cited by: §3.1.
- (2018) Why Reinvent the Wheel: Let’s Build Question Answering Systems Together. In Web Conference, pp. 1247–1256. Cited by: §3.1.
- (2012) Wikidata: a new platform for collaborative data collection. In Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume), pp. 1063–1064. External Links: Cited by: §1.
- (1981) On the notions” lexically related” and” head of a word”. Linguistic inquiry 12 (2), pp. 245–274. Cited by: §2.5.