Why we have switched from building full-fledged taxonomies
to simply detecting hypernymy relations
The study of taxonomies and hypernymy relations has been extensive on the Natural Language Processing (NLP) literature. However, the evaluation of taxonomy learning approaches has been traditionally troublesome, as it mainly relies on ad-hoc experiments which are hardly reproducible and manually expensive. Partly because of this, current research has been lately focusing on the hypernymy detection task. In this paper we reflect on this trend, analyzing issues related to current evaluation procedures. Finally, we propose two potential avenues for future work so that is-a relations and resources based on them play a more important role in downstream NLP applications.
Jose Camacho-Collados Department of Computer Science Sapienza University of Rome email@example.com
Taxonomies are hierarchical organizations of concepts from domains of knowledge. They generally constitute the backbone of ontologies (Velardi et al., 2013) and contribute to applications such as information search, retrieval, website navigation and records management (Bordea et al., 2015), to name a few. In order to construct a taxonomy, a prior step of extracting hypernymy (is-a) relations between pairs of concepts needs to be performed. The result of this prior step is a list of edges which are later integrated into the taxonomic data structure. This prior step has been shown to directly help in downstream applications such as Question Answering (Prager et al., 2008; Yahya et al., 2013) or semantic search (Hoffart et al., 2014). Interestingly, the research attention seems to have shifted from building full-fledged taxonomies entirely or partly from scratch to this prior step of inducing hypernymy relations. We argue that this is largely due not only to the reduced complexity in collecting training data, but also to a more straightforward evaluation. There are no standard benchmarks for taxonomy evaluation and they mainly rely on non-replicable manual evaluation (Gupta et al., 2016). On the contrary, the hypernymy detection task is easier to evaluate as it counts with standard evaluation benchmarks which make the comparison of systems rather straightforward. However, previous work strongly questions the fitness of using a manually-crafted taxonomy like WordNet (Miller, 1995) for evaluating systems that harvest terms and their corresponding hypernyms from text (Hovy et al., 2009). Additionally, there are other issues and questions related to the utility of hypernymy detection techniques in downstream applications that may arise due to current evaluation practices.
This discussion paper is structured as follows: we first introduce basic notions on the automatic acquisition of hierarchically-structured knowledge (Section 2). Then, we discuss current problems for evaluating taxonomy learning techniques, and explain the shift of current research towards the development of hypernymy detection models (Section 3). Finally, we propose two lines of research for future work (Section 4).
Representing domain-specific knowledge in the form of hierarchical concepts, i.e. taxonomies (see Figure 1 for an example), has been a long-standing research problem. However, building resources like WordNet (or any domain-specific ontology) requires an almost prohibitive manual effort (Fountain and Lapata, 2012). Therefore, building taxonomies automatically (or semi-automatically) has attracted the interest of the NLP community. The process of automatically building taxonomies is usually divided into two main steps: finding hypernyms for concepts, which may constitute a field of research in itself (Section 2.1) and refining the structure into a taxonomy (Section 2.2).
2.1 Hypernymy identification
Hypernymy identification is generally split in two broad categories: pattern-based111Pattern-based has also been referred to as path-based (Shwartz et al., 2016) or rule-based (Navigli and Velardi, 2010) in the literature. and distributional. Pattern-based techniques exploit the co-ocurrences of the terms which compose the hypernymy relation in text corpora. Pattern-based methods have traditionally been based on the so-called Hearst’s patterns (Hearst, 1992), which are a set of lexico-syntactic patterns to identify hypernymy relations. Other approaches have built up on these patterns with the aid of various syntactic and statistical techniques (Etzioni et al., 2005; Snow et al., 2006; Kozareva and Hovy, 2010). Syntactic clues have also played an important role on more recent approaches exploiting supervised techniques (Navigli and Velardi, 2010; Boella and Di Caro, 2013).
The second branch for hypernymy identification exploits distributional models. These methods generally address the problem using supervised techniques over the distributional representations of the terms included in the hypernymy relation (Baroni et al., 2012; Roller et al., 2014; Weeds et al., 2014; Levy et al., 2015; Yu et al., 2015; Roller and Erk, 2016). Unlike pattern-based techniques, distributional models do not assume that terms and hypernyms co-occur in the same sentence. Recent approaches have also leveraged supervised distributional models exploiting a domain-aware transformation matrix between the vector spaces of terms and hypernyms (Fu et al., 2014; Espinosa-Anke et al., 2016a; Camacho-Collados and Navigli, 2017). Finally, Shwartz et al. (2016) proposed a LSTM-based architecture encoding both syntactical and distributional information which proved effective in the hypernymy detection task.
2.2 Taxonomy learning
As explained earlier, taxonomy learning is generally divided into two main phases: (1) identifying hypernymy relations from textual data, and (2) inducting a full-fledge taxonomy based on the relations extracted from the first step. In the literature, most taxonomies have been constructed by exploiting pattern-based approaches (Yang and Callan, 2009; Kozareva and Hovy, 2010; Navigli et al., 2011; Nakashole et al., 2012; Luu Anh et al., 2014; Alfarone and Davis, 2015; Espinosa-Anke et al., 2016b; Flati et al., 2016; Faralli et al., 2016; Gupta et al., 2016). The second phase aims at forming the graph structure of a taxonomy. The techniques performed to achieve this goal vary from one model to another, but generally include the following steps: domain filtering, graph-based induction, and edge pruning and recovery (Velardi et al., 2013).
3 Analysis of Current Trends
3.1 Taxonomy learning evaluation
Traditional procedures to evaluate taxonomies have focused on measuring the quality of the edges, i.e., assessing the quality of the is-a relations (Ponzetto and Strube, 2011; Flati et al., 2014). This process typically consists of extracting a random sample of edges and manually labeling them by human judges. In addition to the manual effort required to perform this evaluation, this procedure is not easily replicable from taxonomy to taxonomy (which would most likely include different sets of concepts), and do not reflect the overall quality of a taxonomy (Gupta et al., 2016).
Moreover, some taxonomy learning approaches link their concepts to existing resources such as Wikipedia (Nakashole et al., 2012; Flati et al., 2014, 2016; Gupta et al., 2016), BabelNet (Espinosa-Anke et al., 2016b) or WordNet (Suchanek et al., 2007; Yamada et al., 2011; Jurgens and Pilehvar, 2015), while others remain at the word level (Kozareva and Hovy, 2010; Alfarone and Davis, 2015). This poses additional problems for evaluating the quality across different taxonomies222This issue is directly extensible to the hypernymy detection and hypernym discovery tasks as well (see Section 3.2).
3.2 Hypernymy detection
Recently the research focus seems to have switched to the study of hypernymy relations only, which may be viewed as the first phase of taxonomy learning techniques (see Section 2.2) or a research field in itself. In particular, current approaches have been specializing on the hypernymy detection tasks (Santus et al., 2014; Weeds et al., 2014; Roller et al., 2014; Shwartz et al., 2016, 2017). The hypernymy detection task is a binary task consisting of, given a pair of words, deciding whether a hypernymic relation holds between them or not. In our view, this shift has occurred due to two main factors:
The evaluation is definitely easier and more reliable since, as mentioned in Section 3.1, taxonomies generally rely on human-based evaluations that are hard to replicate.
The rise of supervised distributional models and neural networks, which can be effectively used to detect hypernymy relations framing the task as a binary classification problem333This simplification of traditional tasks may be extensible to other areas of NLP, where complex tasks have been reduced to simpler classification problems. While this issue is clearly relevant and prone to be discussed, analyzing these trends through a more general perspective is out of the scope of this paper..
The first reason is definitely a valid concern, as taxonomies have been proved difficult to evaluate, usually relying on ad-hoc manual evaluation which is hardly reproducible from one work to another (see Section 3.1). In fact, the creation of reliable hypernymy detection datasets have contributed to a rise of brand-new algorithms on the area. A popular dataset to evaluate hypernymy detection is BLESS (Baroni and Lenci, 2011), which includes additional relations such as meronymy or co-hyponymy. However, it is relatively small as it only contains 200 distinct target concepts. Other datasets directly rely on existing hand-crafted taxonomies like WordNet (Snow et al., 2004; Boleda et al., 2017) or include additional resources such as Wikidata (Vrandečić and Krötzsch, 2014) or DBPedia (Auer et al., 2007), as in Shwartz et al. (2016).
However, what is the main utility of detecting a hypernymy relation between a pair of words? The most direct answer is that the main application is to be able to find hypernymy relations for a given concept, which is arguably the main practical feature in downstream applications, e.g. question answering (What is the longest river in Asia?). However, if this is the main application, why do the evaluation focus on detecting hypernymy relations only? The step from detecting hypernymy relations to discovering or finding hypernyms for a given concept is feasible but unfortunately not trivial. To the best of our knowledge there are no approaches which use a hypernymy detection system to âdiscoverâ hypernyms as a result.
Additionally, considering that research on hypernyms is mainly envisaged to be used as a proxy for either constructing new taxonomies or integrating them into end-user applications, these automatic systems that are being evaluated on the hypernymy detection task using already existing taxonomies such as WordNet or Wikidata may not be reliable enough. One would definitely be more confident using the âgold-standardâ WordNet or Wikidata on applications instead. Wikidata and especially WordNet clearly do not have a full coverage, especially on specialized domains. However, automatic systems are usually not evaluated outside these resources, which makes them unreliable (in the sense that they have not been tested) outside these resources.
On a recent work, Espinosa-Anke et al. (2016a), presented a supervised domain-aware hypernym discovery system and evaluated it inside and outside Wikidata. Some results were encouraging but also showed that we are still far from having a reliable hypernym discovery system which could replace existing taxonomies in many domains. Further research should focus on constructing better benchmarks and developing methods which are ready to be deployed in downstream applications by extending or/and accurately replacing the is-a edges of current hand-crafted taxonomies, particularly on specialized domains.
4 Conclusion and Future Work
In this paper we have discussed the current state of hypernym and taxonomy research. We have particularly focused on some issues arisen from standard evaluation practices present. Based on the main insights extracted from this discussion, we present two possible lines for future research: improvement of current evaluation practices for taxonomy evaluation (Section 4.1) and the development of systems for the hypernymy discovery task along with the creation of new challenging benchmarks (Section 4.2).
4.1 Improvement of taxonomy evaluation procedures
As explained in Section 3.1, the evaluation of taxonomy learning techniques have been troublesome for a number of reasons, being one of them the importance given to measuring the accuracy of their is-a relations. Taxonomies should certainly evaluate their is-a relations quantitatively but the manual effort and non-replicability of these experiments make them undesirable. One possible solution to this problem could be to rely on the hypernym discovery task (see the following section). However, this is not enough to provide a global evaluation of the taxonomy. Further research should focus on developing reliable evaluation frameworks for the different features of a taxonomy, following the line of Gupta et al. (2016). Gupta et al. (2016) who proposed a comprehensive evaluation framework going beyond the edge-level by additionally evaluating the granularity and generalization paths of a taxonomy.
4.2 Construction of new challenging benchmarks for hypernym discovery
As discussed in Section 3.2, we argue that future research should pay a renewed attention to the hypernym discovery in addition to the hypernymy detection task. Hypernym discovery may be viewed as the first step of taxonomy learning techniques and constitutes a research field in itself. Systems may be tested on extensions of existing hypernymy detection datasets based on resources like Wikidata or WordNet, e.g. (Snow et al., 2004; Shwartz et al., 2016). Standard datasets are reduced to the hypernymy detection binary classification task, i.e. the system should decide whether dog-animal or dog-mammal constitute a hypernymy relation or not. Instead, using existing taxonomies and a similar procedure, a dataset could be created for the hypernym discovery task, i.e., given the word dog, the system should be able to discover its hypernyms mammal or animal, among others. For the evaluation we can borrow some traditional information retrieval measures such as Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), R-Precision (R-P) or Precicion at (P@) (see Bian et al. (2008) for more details on these measures).
In order to assess the reliability of hypernym detection systems, new benchmarks would be required. Apart from current hypernymy detection datasets based on existing lexical resources which can be re-framed for the hypernymy discovery task, we should construct challenging datasets independent from existing lexical resources, or constructed on top of these resources. This will show the potential of our systems to go beyond existing manually-crafted taxonomies. There are some existing domain-specific datasets from SemEval (Bordea et al., 2015, 2016) which are rarely used in practice but may constitute an interesting starting point for this new evaluation branch. As an extrinsic evaluation it may be useful to construct datasets which are a direct proxy of downstream NLP applications such as question answering or information retrieval.
Additionally, it has shown by Boleda et al. (2017) that concepts and entities/instances behave differently with respect to hypernymy relations. Laika-dog (Laika being a instance) and dog-mammal (dog being a concept) are, respectively, two examples of instantiation and hypernymy relation. These two different kinds of relation, which are often interchangeable in the literature, may be further introduced in these datasets.
Jose Camacho-Collados is supported by a Google Doctoral Fellowship in Natural Language Processing. I would also like to thank Luis Espinosa-Anke for our interesting discussions on this topic.
- Alfarone and Davis (2015) Daniele Alfarone and Jesse Davis. 2015. Unsupervised learning of an is-a taxonomy from a limited domain-specific corpus. In Proceedings of IJCAI 2015. pages 1434–1441.
- Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, Springer, pages 722–735.
- Baroni et al. (2012) Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceedings of EACL. pages 23–32.
- Baroni and Lenci (2011) Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics. Association for Computational Linguistics, pages 1–10.
- Bian et al. (2008) Jiang Bian, Yandong Liu, Eugene Agichtein, and Hongyuan Zha. 2008. Finding the right facts in the crowd: factoid question answering over social media. In Proceedings of the 17th international conference on World Wide Web. ACM, pages 467–476.
- Boella and Di Caro (2013) Guido Boella and Luigi Di Caro. 2013. Supervised learning of syntactic contexts for uncovering definitions and extracting hypernym relations in text databases. In Machine learning and knowledge discovery in databases, Springer, pages 64–79.
- Boleda et al. (2017) Gemma Boleda, Abhijeet Gupta, and Sebastian Padó. 2017. Instances and concepts in distributional space. In Proceedings of EACL (2). Association for Computational Linguistics, Valencia, Spain.
- Bordea et al. (2015) Georgeta Bordea, Paul Buitelaar, Stefano Faralli, and Roberto Navigli. 2015. Semeval-2015 task 17: Taxonomy extraction evaluation (texeval). In Proceedings of the SemEval workshop.
- Bordea et al. (2016) Georgeta Bordea, Els Lefever, and Paul Buitelaar. 2016. Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2). In SemEval-2016. Association for Computational Linguistics, pages 1081–1091.
- Camacho-Collados and Navigli (2017) Jose Camacho-Collados and Roberto Navigli. 2017. BabelDomains: Large-Scale Domain Labeling of Lexical Resources. In Proceedings of EACL (2). Valencia, Spain.
- Espinosa-Anke et al. (2016a) Luis Espinosa-Anke, Jose Camacho-Collados, Claudio Delli Bovi, and Horacio Saggion. 2016a. Supervised distributional hypernym discovery via domain adaptation. In Proceedings of EMNLP. pages 424–435.
- Espinosa-Anke et al. (2016b) Luis Espinosa-Anke, Horacio Saggion, Francesco Ronzano, and Roberto Navigli. 2016b. Extasem! extending, taxonomizing and semantifying domain terminologies. In Proceedings of AAAI.
- Etzioni et al. (2005) Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence 165(1):91–134.
- Faralli et al. (2016) Stefano Faralli, Alexander Panchenko, Chris Biemann, and Simone P Ponzetto. 2016. Linked disambiguated distributional semantic networks. In International Semantic Web Conference. pages 56–64.
- Flati et al. (2014) Tiziano Flati, Daniele Vannella, Tommaso Pasini, and Roberto Navigli. 2014. Two is bigger (and better) than one: the wikipedia bitaxonomy project. In Proceedings of ACL.
- Flati et al. (2016) Tiziano Flati, Daniele Vannella, Tommaso Pasini, and Roberto Navigli. 2016. MultiWiBi: the Multilingual Wikipedia Bitaxonomy Project. Artificial Intelligence, to appear .
- Fountain and Lapata (2012) Trevor Fountain and Mirella Lapata. 2012. Taxonomy induction using hierarchical random graphs. In Proceedings of NAACL. Association for Computational Linguistics, pages 466–476.
- Fu et al. (2014) Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning semantic hierarchies via word embeddings. In Proceedings of ACL.
- Gupta et al. (2016) Amit Gupta, Francesco Piccinno, Mikhail Kozhevnikov, Marius Pasca, and Daniele Pighin. 2016. Revisiting taxonomy induction over wikipedia. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan, pages 2300–2309. http://aclweb.org/anthology/C16-1217.
- Hearst (1992) Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics. pages 539–545.
- Hoffart et al. (2014) Johannes Hoffart, Dragan Milchevski, and Gerhard Weikum. 2014. Stics: searching with strings, things, and cats. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, pages 1247–1248.
- Hovy et al. (2009) Eduard Hovy, Zornitsa Kozareva, and Ellen Riloff. 2009. Toward completeness in concept extraction and classification. In Proceedings of EMNLP. pages 948–957.
- Jurgens and Pilehvar (2015) David Jurgens and Mohammad Taher Pilehvar. 2015. Reserating the awesometastic: An automatic extension of the wordnet taxonomy for novel terms. In HLT-NAACL. pages 1459–1465.
- Jurgens and Pilehvar (2016) David Jurgens and Mohammad Taher Pilehvar. 2016. Semeval-2016 task 14: Semantic taxonomy enrichment. Proceedings of SemEval pages 1092–1102.
- Kozareva and Hovy (2010) Zornitsa Kozareva and Eduard Hovy. 2010. A semi-supervised method to learn and construct taxonomies using the web. In Proceedings of EMNLP. pages 1110–1118.
- Levy et al. (2015) Omer Levy, Steffen Remus, Chris Biemann, Ido Dagan, and Israel Ramat-Gan. 2015. Do supervised distributional methods really learn lexical inference relations? In Proceedings of NAACL 2015. Denver, Colorado, USA.
- Luu Anh et al. (2014) Tuan Luu Anh, Jung-jae Kim, and See Kiong Ng. 2014. Taxonomy construction using syntactic contextual evidence. In Proceedings of EMNLP. pages 810–819.
- Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- Nakashole et al. (2012) Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Relational Patterns with Semantic Types. In Proceedings of EMNLP-CoNLL. pages 1135–1145.
- Navigli and Velardi (2010) Roberto Navigli and Paola Velardi. 2010. Learning word-class lattices for definition and hypernym extraction. In ACL. pages 1318–1327.
- Navigli et al. (2011) Roberto Navigli, Paola Velardi, and Stefano Faralli. 2011. A graph-based algorithm for inducing lexical taxonomies from scratch. In IJCAI. pages 1872–1877.
- Ponzetto and Strube (2011) Simone Paolo Ponzetto and Michael Strube. 2011. Taxonomy induction based on a collaboratively built knowledge repository. Artificial Intelligence 175(9-10):1737–1756.
- Prager et al. (2008) John Prager, Jennifer Chu-Carroll, Eric W Brown, and Krzysztof Czuba. 2008. Question answering by predictive annotation. In Advances in Open Domain Question Answering, Springer, pages 307–347.
- Roller and Erk (2016) Stephen Roller and Katrin Erk. 2016. Relations such as hypernymy: Identifying and exploiting hearst patterns in distributional vectors for lexical entailment. In Proceedings of EMNLP. Austin, Texas, pages 2163–2172. https://aclweb.org/anthology/D16-1234.
- Roller et al. (2014) Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet selective: Supervised distributional hypernymy detection. In Proceedings of COLING 2014. Dublin, Ireland.
- Santus et al. (2014) Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine Schulte Im Walde. 2014. Chasing hypernyms in vector spaces with entropy. In EACL. pages 38–42.
- Shwartz et al. (2016) Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of ACL. Berlin, Germany.
- Shwartz et al. (2017) Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of EACL. Association for Computational Linguistics, Valencia, Spain.
- Snow et al. (2004) Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2004. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17 .
- Snow et al. (2006) Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2006. Semantic taxonomy induction from heterogenous evidence. In Proceedings of COLING/ACL 2006. pages 801–808.
- Suchanek et al. (2007) Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW. ACM, pages 697–706.
- Velardi et al. (2013) Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013. OntoLearn Reloaded: A graph-based algorithm for taxonomy induction. Computational Linguistics 39(3):665–707.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10):78–85.
- Weeds et al. (2014) Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pages 2249–2259.
- Yahya et al. (2013) Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, and Gerhard Weikum. 2013. Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, pages 1107–1116.
- Yamada et al. (2011) Ichiro Yamada, Jong-Hoon Oh, Chikara Hashimoto, Kentaro Torisawa, Jun’ichi Kazama, Stijn De Saeger, and Takuya Kawada. 2011. Extending wordnet with hypernyms and siblings acquired from wikipedia. In IJCNLP. pages 874–882.
- Yang and Callan (2009) Hui Yang and Jamie Callan. 2009. A metric-based framework for automatic taxonomy induction. In Proceedings of ACL/IJCNLP. Association for Computational Linguistics, pages 271–279.
- Yu et al. (2015) Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang. 2015. Learning term embeddings for hypernymy identification. In Proceedings of IJCAI. pages 1390–1397.