We present an approach to combining distributional semantic representations induced from text corpora with manually constructed lexical-semantic networks. While both kinds of semantic resources are available with high lexical coverage, our aligned resource combines the domain specificity and availability of contextual information from distributional models with the conciseness and high quality of manually crafted lexical networks. We start with a distributional representation of induced senses of vocabulary terms, which are accompanied with rich context information given by related lexical items. We then automatically disambiguate such representations to obtain a full-fledged proto-conceptualization, i.e. a typed graph of induced word senses. In a final step, this proto-conceptualization is aligned to a lexical ontology, resulting in a hybrid aligned resource. Moreover, unmapped induced senses are associated with a semantic type in order to connect them to the core resource. Manual evaluations against ground-truth judgments for different stages of our method as well as an extrinsic evaluation on a knowledge-based Word Sense Disambiguation benchmark all indicate the high quality of the new hybrid resource. Additionally, we show the benefits of enriching top-down lexical knowledge resources with bottom-up distributional information from text for addressing high-end knowledge acquisition tasks such as cleaning hypernym graphs and learning taxonomies from scratch.
Distributional Semantics for Enriching Lexical Semantic Resources]
A Framework for Enriching Lexical Semantic Resources with Distributional Semantics
Biemann et al.]
Chris Biemann, Stefano Faralli,
Alexander Panchenko, Simone P. Ponzetto
Language Technology Group, Department of Informatics, Faculty of Mathematics,
Informatics, and Natural Sciences, Universität Hamburg, Germany
Data and Web Science Group, School of Business Informatics and Mathematics,
Universität Mannheim, Germany
Automatic enrichment of semantic resources is an important problem (?; ?) as it attempts to alleviate the extremely costly process of manual lexical resource construction. Distributional semantics (?; ?; ?) provides an alternative automatic meaning representation framework that has been shown to benefit many text-understanding applications.
Recent years have witnessed an impressive amount of work on combining the symbolic semantic information available in manually constructed lexical resources with distributional information, where words are usually represented as dense numerical vectors, a.k.a. embeddings. Examples of such approaches include methods that modify the Skip-gram model (?) to learn sense embeddings (?) based on the sense inventory of WordNet, methods that learn embeddings of synsets as given in a lexical resource (?) or methods that acquire word vectors by applying random walks over lexical resources to learn a neural language model (?). Besides, alternative approaches like NASARI (?) and MUFFIN (?) looked at ways to produce joint lexical and semantic vectors for a common representation of word senses in text and in lexical resources. Retrofitting approaches, e.g. (?), efficiently “consume” lexical resources as input in order to improve the quality of word embeddings, but do not add anything to the resource itself. Other approaches, such as AutoExtend (?), NASARI and MUFFIN, learn vector representations for existing synsets that can be added to the resource, however are not able to add missing senses discovered from text.
In these contributions, the benefits of such hybrid knowledge sources for tasks in computational semantics such as semantic similarity and Word Sense Disambiguation (WSD) (?) have been extensively demonstrated. However, none of the existing approaches, to date, have been designed to use distributional information for the enrichment of lexical semantic resources, i.e. adding new symbolic entries.
In this article, we consequently set out to filling this gap by developing a framework for enriching lexical semantic resources with distributional semantic models. The goal of such framework is the creation of a resource that brings together the ‘best of both worlds’, namely the precision and interpretability from the lexicographic tradition that describes senses and semantic relations manually, and the versatility of data-driven, corpus-based approaches that derive senses automatically.
A lexical resource enriched with additional knowledge induced from text can boost the performance of natural language understanding tasks like WSD or Entity Linking (?; ?), where it is crucial to have a comprehensive list of word senses as well as the means to assign the correct of many possible senses for a given word in context.
Consider, for instance, the sentence:
“Regulator of calmodulin signalling (RCS) knockout mice display anxiety-like behavior and motivational deficits”.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3622044
No synset for “RCS” can be found in either WordNet (?) or BabelNet (?), yet it is distributionally related to other bio-medical concepts and thus can help to disambiguate the ambiguous term mice to its ‘animal’ meaning, as opposed to the ‘computer peripheral device’.
Our approach yields a hybrid resource that combines symbolic and statistical meaning representations while i) staying purely on the lexical-symbolic level, ii) explicitly distinguishing word senses, and iii) being human readable. These properties are crucial to be able to embed such a resource in an explicit semantic data space such as, for instance, the Linguistic Linked Open Data ecosystem (?). According to (?), the Semantic Web and natural language understanding are placed at the heart of current efforts to understand the Web on a large scale.
We take the current line of research on hybrid semantic representations one step forward by presenting a methodology for inducing distributionally-based sense representations from text, and for linking them to a reference lexical resource. Central to our method is a novel sense-based distributional representation that we call proto-conceptualization (PCZ). A PCZ is a repository of word senses induced from text, where each sense is represented with related senses, hypernymous senses, and aggregated clues for contexts in which the respective sense occurs in text. Besides, to further improve interpretability, each sense has an associated image. We build a bridge between the PCZ and lexical semantic networks by establishing a mapping between these two kinds of resources.We use WordNet and BabelNet, however our approach can be used with similar resources, e.g. those listed at http://globalwordnet.org/wordnets-in-the-world. This results in a new knowledge resource that we refer to as hybrid aligned resource: here, senses induced from text are aligned to a set of synsets from a reference lexical resource, whereas induced senses that cannot be matched are included as additional synsets. By linking our distributional representations to a repository of symbolically-encoded knowledge, we are able to complement wide-coverage statistical meaning representations with explicit relational knowledge as well as to extend the coverage of the reference lexical resource based on the senses induced from a text collection. The main contributions of this article are listed as follows:
A framework for enriching lexical semantic resources: we present a methodology for combining information from distributional semantic models with manually constructed lexical semantic resources.
A hybrid lexical semantic resource: we apply our framework to produce a novel hybrid resource obtained by linking disambiguated distributional lexical semantic networks to WordNet and BabelNet.
Applications of the hybrid resource: besides intrinsic evaluations of our approach, we test the utility of our resource extrinsically on the tasks of word sense disambiguation and taxonomy induction, demonstrating the benefits of combining distributional and symbolic lexical semantic knowledge.
The remainder of this article is organized as follows: we first review related work in Section 2 and provide an overview of our framework to build a hybrid aligned resource from distributional semantic vectors and a reference knowledge graph in Section 3. Next, we provide details on our methodology to construct PCZs and to link them to a lexical resource in Sections LABEL:sec:DDTdesc and LABEL:sec:linking, respectively. In Section LABEL:sec:experiments, we benchmark the quality of our resource in different evaluation settings, including an intrinsic and an extrinsic evaluation on the task of knowledge-based WSD using a dataset from a SemEval task. Section LABEL:sec:applications provides two application-based evaluations that demonstrate how our hybrid resource can be used for taxonomy induction. We conclude with final remarks and future directions in Section LABEL:sec:conclusions.
2 Related Work
2.1 Automatic Construction of Lexical Semantic Resources
In the past decade, large efforts have been undertaken to research the automatic acquisition of machine-readable knowledge on a large scale by mining large repositories of textual data (?; ?; ?; ?). At this, collaboratively constructed resources have been exploited, used either in isolation (?; ?; ?), or complemented with manually assembled knowledge sources (?; ?; ?; ?). As a result of this, recent years have seen a remarkable renaissance of knowledge-rich approaches for many different artificial intelligence tasks (?). Knowledge contained within these very large knowledge repositories, however, has major limitations in that these resources typically do not contain any linguistically grounded probabilistic representation of concepts, instances, and their attributes – namely, the bridge between wide-coverage conceptual knowledge and its instantiation within natural language texts. While there are large-scale lexical resources derived from large corpora such as ProBase (?), these are usually not sense-aware but conflate the notions of term and concept. With this work, we provide a framework that aims at augmenting any of these wide-coverage knowledge sources with distributional semantic information, thus extending them with text-based contextual information.
Another line of research has looked at the problem of Knowledge Base Completion (?) (KBC). Many approaches to KBC focus on exploiting other KBs (?; ?) for acquiring additional knowledge, or rely on text corpora – either based on distant supervision (?; ?; ?) or by rephrasing KB relations as queries to a search engine (?) that returns results from the web as corpus. Alternative methods primarily rely on existing information in the KB itself (?; ?; ?; ?) to simultaneously learn continuous representations of KB concepts and KB relations by exploiting the KB structure as the ground truth for supervision, inferring additional relations from existing ones. Lexical semantic resources and text are synergistic sources, as shown by complementary work from (?), who improve the quality of semantic vectors based on lexicon-derived relational information.
Here, we follow this intuition of combining structured knowledge resources with distributional semantic information from text, but focus instead on providing hybrid semantic representations for KB concepts and entities, as opposed to the classification task of KBC that aims at predicting additional semantic relations between known entities.
2.2 Combination of Distributional Semantics with Lexical Resources
Several prior approaches combined distributional information extracted from text with information available in lexical resources like e.g. WordNet. This includes a model (?) to learn word embeddings based on lexical relations of words from WordNet and PPDB (?). The objective function of this model combines that of the skip-gram model (?) with a term that takes into account lexical relations of target words. In work aimed at retrofitting word vectors (?), a related approach was proposed that performs post-processing of word embeddings on the basis of lexical relations from lexical resources. Finally, (?) also aim at improving word vector representations by using lexical relations from WordNet, targeting similar representations of synonyms and dissimilar representations of antonyms. While all these three approaches show excellent performance on word relatedness evaluations, they do not model word senses – in contrast to other work aimed instead at learning sense embeddings using the word sense inventory of WordNet (?).
A parallel line of research has recently focused on learning unified statistical and symbolic semantic representations. Approaches aimed at providing unified semantic representations from distributional information and lexical resources have accordingly received an increasing level of attention (?; ?; ?; ?; ?), inter alia (cf. also our introductory discussion in Section 1), and hybrid meaning representations have been shown to benefit challenging semantic tasks such as WSD and semantic similarity at word level and text level.
All these diverse contributions indicate the benefits of hybrid knowledge sources for learning word and sense representations: here, we further elaborate along this line of research and develop a new hybrid resource that combines information from the knowledge graph with distributional sense representations that are human readable and easy to interpret, in contrast to dense vector representations, a.k.a. word embeddings like GloVe (?) or word2vec (?). As a result of this, we are able to provide, to the best of our knowledge, the first hybrid knowledge resource that is fully integrated and embedded within the Semantic Web ecosystem provided by the Linguistic Linked Open Data cloud (?). Note that this complementary to recent efforts aimed at linking natural language expressions in text with semantic relations found within LOD knowledge graphs (?), in that we focus instead on combining explicit semantic information with statistical, distributional semantic representations of concepts and entities into an augmented resource.