Specialising Word Vectors for Lexical Entailment

Specialising Word Vectors for Lexical Entailment

Ivan Vulić    Nikola Mrkšić
University of Cambridge
{iv250,nm480}@cam.ac.uk   
Abstract

We present lear (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as the is-a or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into the initial vector space, the LE specialisation procedure brings true hyponymy-hypernymy pairs closer together in the transformed Euclidean space. The proposed asymmetric distance measure adjusts the norms of word vectors to reflect the actual WordNet-style hierarchy of concepts. Simultaneously, a joint objective enforces semantic similarity using the symmetric cosine distance, yielding a vector space specialised for both lexical relations at once. lear specialisation achieves state-of-the-art performance in the tasks of hypernymy directionality, hypernymy detection and graded lexical entailment, demonstrating the effectiveness and robustness of the proposed model.

Specialising Word Vectors for Lexical Entailment


Ivan Vulić and Nikola Mrkšić University of Cambridge {iv250,nm480}@cam.ac.uk

1 Introduction

Word representation learning has become a research area of central importance in NLP, with its usefulness demonstrated across application areas such as parsing (Chen and Manning, 2014), machine translation (Zou et al., 2013), and many others (Turian et al., 2010; Collobert et al., 2011). Standard techniques for inducing word embeddings rely on the distributional hypothesis (Harris, 1954), using co-occurrence information from large textual corpora to learn meaningful word representations (Mikolov et al., 2013; Levy and Goldberg, 2014; Pennington et al., 2014; Bojanowski et al., 2017).

A major drawback of the distributional hypothesis is that it coalesces different relationships, such as synonymy and topical relatedness, into a single vector space. A popular solution is to go beyond stand-alone unsupervised learning and fine-tune distributional vector spaces by using external knowledge from human- or automatically-constructed knowledge bases. This is often done as a post-processing step, where distributional vectors are gradually refined to satisfy linguistic constraints extracted from lexical resources such as WordNet (Faruqui et al., 2015; Mrkšić et al., 2016), the Paraphrase Database (PPDB) (Wieting et al., 2015), or BabelNet (Mrkšić et al., 2017; Vulić et al., 2017a). One advantage of post-processing methods is that they treat the input vector space as a black box, making them applicable to any input space.

A key property of these methods is their ability to transform the vector space by specialising it for a particular relationship between words.111Distinguishing between synonymy and antonymy has a positive impact on real-world language understanding tasks such as Dialogue State Tracking (Mrkšić et al., 2017). Prior work has predominantly focused on distinguishing between semantic similarity and conceptual relatedness (Faruqui et al., 2015; Mrkšić et al., 2017; Vulić et al., 2017b). In this paper, we introduce a novel post-processing model which specialises vector spaces for the lexical entailment (le) relation.

Word-level lexical entailment is an asymmetric semantic relation (Collins and Quillian, 1972; Beckwith et al., 1991). It is a key principle determining the organization of semantic networks into hierarchical structures such as semantic ontologies (Fellbaum, 1998). Automatic reasoning about le supports tasks such as taxonomy creation (Snow et al., 2006; Navigli et al., 2011), natural language inference (Dagan et al., 2013; Bowman et al., 2015), text generation (Biran and McKeown, 2013), and metaphor detection (Mohler et al., 2013).

Our novel le specialisation model, termed lear (Lexical Entailment Attract-Repel), is inspired by Attract-Repel, a state-of-the-art general specialisation framework (Mrkšić et al., 2017).222https://github.com/nmrksic/attract-repel The key idea of lear, illustrated by Figure 1, is to pull desirable (attract) examples described by the constraints closer together, while at the same time pushing undesirable (repel) word pairs away from each other. Concurrently, lear (re-)arranges vector norms so that norm values in the Euclidean space reflect the hierarchical organization of concepts according to the given le constraints: put simply, higher-level concepts are assigned larger norms. Therefore, lear simultaneously captures the hierarchy of concepts (through vector norms) and their similarity (through their cosine distance). The two pivotal pieces of information are combined into an asymmetric distance measure which quantifies the LE strength in the specialised space.

Figure 1: An illustration of lear specialisation. lear controls the arrangement of vectors in the transformed vector space by: 1) emphasising symmetric similarity of le pairs through cosine distance (by enforcing small angles between and or and ); and 2) by imposing an le ordering using vector norms, adjusting them so that higher-level concepts have larger norms (e.g., ).

After specialising four well-known input vector spaces with lear, we test them in three standard word-level le tasks (Kiela et al., 2015): 1) hypernymy directionality; 2) hypernymy detection; and 3) combined hypernymy detection/directionality. Our specialised vectors yield notable improvements over the strongest baselines for each task, with each input space, demonstrating the effectiveness and robustness of lear specialisation.

The employed asymmetric distance allows one to make graded assertions about hierarchical relationships between concepts in the specialised space. This property is evaluated using HyperLex, a recent graded LE dataset (Vulić et al., 2017). The lear-specialised vectors push state-of-the-art Spearman’s correlation from 0.540 to 0.686 on the full dataset (2,616 word pairs), and from 0.512 to 0.705 on its noun subset (2,163 word pairs).

2 Methodology

2.1 The Attract-Repel Framework

Let be the vocabulary, the set of Attract word pairs (e.g., intelligent and brilliant), and the set of Repel word pairs (e.g., vacant and occupied). The Attract-Repel procedure operates over mini-batches of such pairs and . For ease of notation, let each word pair in these two sets correspond to a vector pair , so that a mini-batch of word pairs is given by (similarly for , which consists of example pairs).

Next, the sets of pseudo-negative examples and are defined as pairs of negative examples for each Attract and Repel example pair in mini-batches and . These negative examples are chosen from the word vectors present in or so that, for each Attract pair , the negative example pair is chosen so that is the vector closest (in terms of cosine distance) to and is closest to . Similarly, for each Repel pair , the negative example pair is chosen from the remaining in-batch vectors so that is the vector furthest away from and is furthest from .

The negative examples are used to: a) force Attract pairs to be closer to each other than to their respective negative examples; and b) to force Repel pairs to be further away from each other than from their negative examples. The first term of the cost function pulls Attract pairs together:

where is the hinge loss function and is the attract margin which determines how much closer these vectors should be to each other than to their respective negative examples. The second part of the cost function pushes Repel word pairs away from each other:

In addition to these two terms, an additional regularisation term is used to preserve the abundance of high-quality semantic content present in the distributional vector space, as long as this information does not contradict the injected linguistic constraints. If is the set of all word vectors present in the given mini-batch, then:

where is the L2 regularization constant and denotes the original (distributional) word vector for word . The full Attract-Repel cost function is given by the sum of all three terms.

2.2 Lear: Encoding Lexical Entailment

In this section, the Attract-Repel framework is extended to model lexical entailment jointly with (symmetric) semantic similarity. To do this, the method uses an additional source of external lexical knowledge: let be the set of directed lexical entailment constraints such as (corgi, dog), (dog, animal), or (corgi, animal), with lower-level concepts on the left and higher-level ones on the right (the source of these constraints will be discussed in Section 3). The optimisation proceeds in the same way as before, considering a mini-batch of le pairs consisting of word pairs standing in the (directed) lexical entailment relation.

Unlike symmetric similarity, lexical entailment is an asymmetric relation which encodes a hierarchical ordering between concepts. Inferring the direction of the entailment relation between word vectors requires the use of an asymmetric distance function. We define three different ones, all of which use the word vector’s norms to impose an ordering between high- and low-level concepts:

(1)
(2)
(3)

The lexical entailment term (for the -th asymmetric distance, ) is defined as:

(4)

The first distance serves as the baseline: it uses the word vectors’ norms to order the concepts, that is to decide which of the words is likely to be the higher-level concept. In this case, the magnitude of the difference between the two norms determines the ‘intensity’ of the le relation. This is potentially problematic, as this distance does not impose a limit on the vectors’ norms. The second and third metric take a more sophisticated approach, using the ratios of the differences between the two norms and either: a) the sum of the two norms; or b) the larger of the two norms. In doing that, these metrics ensure that the cost function only considers the norms’ ratios. This means that the cost function no longer has the incentive to increase word vectors’ norms past a certain point, as the magnitudes of norm ratios grow in size much faster than the linear relation defined by the first distance function.

To model the semantic and the le relations jointly, the lear cost function jointly optimises the four terms of the expanded cost function:

Le Pairs as Attract Constraints

The combined cost function makes use of the batch of lexical constraints twice: once in the defined asymmetric cost function , and once in the symmetric Attract term . This means that words standing in the lexical entailment relation are forced to be similar both in terms of cosine distance (via the symmetric Attract term) and in terms of the asymmetric distance from Eq. (4).

Decoding Lexical Entailment

The defined cost function serves to encode semantic similarity and le relations in the same vector space. Whereas the similarity can be inferred from the standard cosine distance, the lear optimisation embeds lexical entailment as a combination of the symmetric Attract term and the newly defined asymmetric cost function. Consequently, the metric used to determine whether two words stand in the le relation must combine the two cost terms as well. We define the le decoding metric as:

(5)

where denotes the cosine distance. This decoding function combines the symmetric and the asymmetric cost term, in line with the combination of the two used to perform lear specialisation. In the evaluation, we show that combining the two cost terms has a synergistic effect, with both terms contributing to stronger performance across all le tasks used for evaluation.

3 Experimental Setup

Starting Distributional Vectors

To test the robustness of lear specialisation, we experiment with a variety of well-known, publicly available English word vectors: 1) Skip-Gram with Negative Sampling (SGNS) (Mikolov et al., 2013) trained on the Polyglot Wikipedia (Al-Rfou et al., 2013) by Levy and Goldberg (2014); 2) Glove Common Crawl (Pennington et al., 2014); 3) context2vec (Melamud et al., 2016), which replaces CBOW contexts with contexts based on bidirectional LSTMs (Hochreiter and Schmidhuber, 1997); and 4) fastText (Bojanowski et al., 2017), a SGNS variant which builds word vectors as the sum of their constituent character n-gram vectors.333All vectors are -dimensional except for the -dimensional context2vec vectors; for further details regarding the architectures and training setup of the used vector collections, we refer the reader to the original papers.

Linguistic Constraints

We use three groups of linguistic constraints in the lear specialisation model, covering three different relation types which are all beneficial to the specialisation process: directed 1) lexical entailment (le) pairs; 2) synonymy pairs; and 3) antonymy pairs. Synonyms are included as symmetric attract pairs (i.e., the pairs) since they can be seen as defining a trivial symmetric is-a relation (Rei and Briscoe, 2014; Vulić et al., 2017). For a similar reason, antonyms are clear repel constraints as they anti-correlate with the LE relation.444In short, the question “Is a type of ?” (synonymy) is trivially true, while the question “Is a type of ?” (antonymy) is trivially false. Synonymy and antonymy constraints are taken from prior work (Zhang et al., 2014; Ono et al., 2015): they are extracted from WordNet (Fellbaum, 1998) and Roget (Kipfer, 2009). In total, we work with 1,023,082 synonymy pairs (11.7 synonyms per word on average) and 380,873 antonymy pairs (6.5 per word).555https://github.com/tticoin/AntonymDetection

As in prior work (Nguyen et al., 2017; Nickel and Kiela, 2017), le constraints are extracted from the WordNet hierarchy, relying on the transitivity of the le relation. This means that we include both direct and indirect le pairs in our set of constraints (e.g., (pangasius, fish), (fish, animal), and (pangasius, animal)). We retained only noun-noun and verb-verb pairs, while the rest were discarded: the final number of le constraints is 1,545,630.666We also experimented with additional 30,491 le constraints from the Paraphrase Database (PPDB) 2.0 (Pavlick et al., 2015). Adding them to the WordNet-based le pairs makes no significant impact on the final performance. We also used synonymy and antonymy pairs from other sources, such as word pairs from PPDB used previously by Wieting et al. (2015), and BabelNet (Navigli and Ponzetto, 2012) used by Mrkšić et al. (2017), reaching the same conclusions.

Training Setup

We adopt the original Attract-Repel model setup without any fine-tuning. Hyperparameter values are set to: , , (Mrkšić et al., 2017). The models are trained for 5 epochs, with batch sizes set to for faster convergence.

4 Results and Discussion

We test and analyse lear-specialised vector spaces in two standard word-level le tasks used in prior work: hypernymy directionality and detection (Section 4.1) and graded le (Section 4.2).

4.1 LE Directionality and Detection

The first evaluation uses three classification-style tasks with increased levels of difficulty. The tasks are evaluated on three datasets used extensively in the le literature (Roller et al., 2014; Santus et al., 2014; Weeds et al., 2014; Shwartz et al., 2017; Nguyen et al., 2017), compiled into an integrated evaluation set by Kiela et al. (2015).777http://www.cl.cam.ac.uk/dk427/generality.html

(a) BLESS: Directionality
(b) WBLESS: Detection
(c) BIBLESS: Detect+direct
Figure 2: Summary of the results on three different word-level LE subtasks: (a) directionality; (b) detection; (c) detection and directionality. Vertical bars denote the results obtained by different input word vector spaces which are post-processed/specialised by our lear specialisation model using three variants of the asymmetric distance (, , ), see Section 2. Thick horizontal red lines refer to the best reported scores on each subtask for these datasets; the baseline scores are taken from Nguyen et al. (2017).

The first task, le directionality, is conducted on 1,337 le pairs originating from the bless evaluation set (Baroni and Lenci, 2011). Given a true le pair, the task is to predict the correct hypernym. With lear-specialised vectors this is achieved by simply comparing the vector norms of each concept in a pair: the one with the larger norm is the hypernym (see Figure 1).

The second task, le detection, involves a binary classification on the wbless dataset (Weeds et al., 2014) which comprises 1,668 word pairs standing in a variety of relations (le, meronymy-holonymy, co-hyponymy, reversed le, no relation). The model has to detect a true le pair, that is, to distinguish between the pairs where the statement is a (type of) is true from all other pairs. With lear vectors, this classification is based on the asymmetric distance score: if the score is above a certain threshold, we classify the pair as “true le”, otherwise as “other”. While Kiela et al. (2015) manually define the threshold value, we follow the approach of Nguyen et al. (2017) and cross-validate: in each of the 1,000 iterations, 2% of the pairs are sampled for threshold tuning, and the remaining 98% are used for testing. The reported numbers are therefore average accuracy scores.888We have conducted more le directionality and detection experiments on other datasets such as EVALution (Santus et al., 2015), the dataset of Baroni et al. (2012), and the dataset of Lenci and Benotto (2012) with similar performances and findings. We do not report all these results for brevity and clarity of presentation.

The final task, le detection and directionality, concerns a three-way classification on bibless, a relabeled version of wbless. The task is now to distinguish both le pairs () and reversed le pairs () from other relations ( 0), and then additionally select the correct hypernym in each detected le pair. We apply the same test protocol as in the le detection task.

Results and Analysis

The original paper of Kiela et al. (2015) reports the following best scores on each task: 0.88 (bless), 0.75 (wbless), 0.57 (bibless). These scores were recently surpassed by Nguyen et al. (2017), who, instead of post-processing, combine WordNet-based constraints with an SGNS-style objective into a joint model. They report the best scores to date: 0.92 (bless), 0.87 (wbless), and 0.81 (bibless).

The performance of the four lear-specialised word vector collections is shown in Figure 2 (together with the strongest baseline scores for each of the three tasks). The comparative analysis confirms the increased complexity of subsequent tasks. lear specialisation of each of the starting vector spaces consistently outperformed all baseline scores across all three tasks. The extent of the improvements is correlated with task difficulty: it is lowest for the easiest directionality task (), and highest for the most difficult detection plus directionality task ().

The results show that the two lear variants which do not rely on absolute norm values and perform a normalisation step in the asymmetric distance (D2 and D3) have a slight edge over the D1 variant which operates with unbounded norms. The difference in performance between D2/D3 and D1 is more pronounced in the graded LE task (see Section 4.2). This shows that the use of unbounded vector norms diminishes the importance of the symmetric cosine distance in the combined asymmetric distance. Conversely, the synergistic combination used in D2/D3 does not suffer from this issue.

The high scores achieved with each of the four word vector collections show that lear is not dependent on any particular word representation architecture. Moreover, the extent of the performance improvements in each task suggests that lear is able to reconstruct the concept hierarchy coded in the input linguistic constraints.

Norm Norm Norm
terrier 0.87 laptop 0.60 cabriolet 0.74
dog 2.64 computer 2.96 car 3.59
mammal 8.57 machine 6.15 vehicle 7.78
vertebrate 10.96 device 12.09 transport 8.01
animal 11.91 artifact 17.71 instrumentality 14.56
organism 20.08 object 23.55
Table 1: L2 norms for selected concepts from the WordNet hierarchy. Input: fastText; lear: D2.

Further Discussion

To verify that the knowledge concerning the position in the semantic hierarchy actually arises from vector norms, we also manually inspect the norms after lear specialisation. A few examples are provided in Table 1. They indicate a desirable pattern in the norm values which imposes a hierarchical ordering on the concepts. Note that the original distributional SGNS model (Mikolov et al., 2013) does not normalise vectors to unit length after training. However, these norms are not at all correlated with the desired hierarchical ordering, and are therefore useless for le-related applications: the non-specialised distributional SGNS model scores 0.44, 0.48, and 0.34 on the three tasks, respectively.

(a) HyperLex: All
(b) HyperLex: Nouns
All
freq-ratio 0.279
sgns (cos) 0.205
slqs-sim 0.228
visual 0.209
wn-best 0.234
word2gauss 0.206
sim-spec 0.320
order-emb 0.191
Poincaré (nouns) 0.512
HyperVec 0.540
Best lear 0.686
(c) Summary
Figure 3: Results on the graded LE task defined by HyperLex. Following Nickel and Kiela (2017), we use Spearman’s rank correlation scores on: a) the entire dataset (2,616 noun and verb pairs); and b) its noun subset (2,163 pairs). The summary table shows the performance of other well-known architectures on the full HyperLex dataset, compared to the best results achieved using lear specialisation.

4.2 Graded Lexical Entailment

Asymmetric distances in the lear-specialised space quantify the degree of lexical entailment between any two concepts. This means that they can be used to make fine-grained assertions regarding the hierarchical relationships between concepts. We test this property on HyperLex (Vulić et al., 2017), a gold standard dataset for evaluating how well word representation models capture graded le, grounded in the notions of concept (proto)typicality (Rosch, 1973; Medin et al., 1984) and category vagueness (Kamp and Partee, 1995; Hampton, 2007) from cognitive science. HyperLex contains 2,616 word pairs (2,163 noun pairs and 453 verb pairs) scored by human raters in the interval following the question “To what degree is X a (type of) Y?”999From another perspective, one might say that graded le provides finer-grained human judgements on a continuous scale rather than simplifying the judgements into binary discrete decisions. For instance, the HyperLex score for the pair (girl, person) is 5.91/6, the score for (guest, person) is 4.33, while the score for the reversed pair (person, guest) is 1.73.

As shown by the high inter-annotator agreement on HyperLex (0.85), humans are able to consistently reason about graded le.101010For further details concerning HyperLex, we refer the reader to the resource paper (Vulić et al., 2017). The dataset is available at: http://people.ds.cam.ac.uk/iv250/hyperlex.html However, current state-of-the-art representation architectures are far from this ceiling. For instance, Vulić et al. (2017) evaluate a plethora of architectures and report a high-score of only 0.320 (see the summary table in Figure 3). Two recent representation models (Nickel and Kiela, 2017; Nguyen et al., 2017) focused on the LE relation in particular (and employing the same set of WordNet-based constraints as lear) report the highest score of 0.540 (on the entire dataset) and 0.512 (on the noun subset).

Results and Analysis

We scored all HyperLex pairs using the combined asymmetric distance described by Equation (5), and then computed Spearman’s rank correlation with the ground-truth ranking. Our results, together with the strongest baseline scores, are summarised in Figure 3.

The summary table in Figure 3(c) shows the HyperLex performance of several prominent le models. We provide only a quick outline of these models here; further details can be found in the original papers. freq-ratio exploits the fact that more general concepts tend to occur more frequently in textual corpora. sgns (cos) uses non-specialised SGNS vectors and quantifies the LE strength using the symmetric cosine distance between vectors. A comparison of these models to the best-performing lear vectors shows the extent of the improvements achieved using the specialisation approach.

lear-specialised vectors also outperform slqs-sim (Santus et al., 2014) and visual (Kiela et al., 2015), two le detection models similar in spirit to lear. These models combine symmetric semantic similarity (through cosine distance) with an asymmetric measure of lexical generality obtained either from text (slqs-sim) or visual data (visual). The results on HyperLex indicate that the two generality-based measures are too coarse-grained for graded le judgements. These models were originally constructed to tackle le directionality and detection tasks (see Section 4.1), but their performance is surpassed by lear on those tasks as well. The visual model outperforms slqs-sim. However, its numbers on bless (0.88), wbless (0.75), and bibless (0.57) are far from the top-performing lear vectors (0.96, 0.92, 0.88).

wn-best denotes the best result with asymmetric similarity measures which use the WordNet structure as their starting point (Wu and Palmer, 1994; Pedersen et al., 2004). The reported results suggest it is more effective to use WordNet as the source of constraints for specialisation models.

word2gauss (Vilnis and McCallum, 2015) represents words as multivariate -dimensional Gaussians rather than points in the embedding space: it is therefore naturally asymmetric and was used in le tasks before, but its performance on HyperLex indicates that it cannot effectively capture the subtleties required to model graded le.

Most importantly, lear outperforms three recent (and conceptually different) architectures: order-emb (Vendrov et al., 2016), Poincaré (Nickel and Kiela, 2017), and HyperVec (Nguyen et al., 2017). Like lear, all of these models complement distributional knowledge with external linguistic constraints extracted from WordNet. Each model uses a different strategy to exploit the hierarchical relationships encoded in these constraints (their approaches are discussed in Section 5). However, lear, as the first le-oriented post-processor, is able to utilise the constraints more effectively than its competitors. Another advantage of lear is its applicability to any input vector space.

Figures 3(a) and 3(b) indicate that the two lear variants which rely on norm ratios (D2 and D3), rather than on absolute (unbounded) norm differences (D1), achieve stronger performance on HyperLex. The highest correlation scores are again achieved by D2 with all input vector spaces.

Why Symmetric + Asymmetric?

In another experiment, we analyse the contributions of both le-related terms in the lear combined objective function (see Section 2.2). We compare three variants of lear: 1) a symmetric variant which does not arrange vector norms using the term (sym-only); 2) a variant which arranges norms, but does not use le constraints as additional symmetric attract constraints (asym-only); and 3) the full lear model, which uses both cost terms (full). The results with one input space (similar results are achieved with others) are shown in Table 2. This table shows that, while the stand-alone asym-only term seems more beneficial than the sym-only one, using the two terms jointly yields the strongest performance across all LE tasks.

wbless bibless hl-a hl-n
lear variant
sym-only 0.687 0.679 0.469 0.429
asym-only 0.867 0.824 0.529 0.565
full 0.912 0.875 0.686 0.705
Table 2: Analysing the importance of the synergy in the full lear model on the final performance on wbless, bless, HyperLex-All (hl-a) and HyperLex-Nouns (hl-n). Input: fastText. D2.

Le and Semantic Similarity

We also test whether the asymmetric term harms the (norm-independent) cosine distances used to represent semantic similarity. The lear model is compared to the original attract-repel model making use of the same set of linguistic constraints. Two true semantic similarity datasets are used for evaluation: SimLex-999 (Hill et al., 2015) and SimVerb-3500 (Gerz et al., 2016). There is no significant difference in performance between the two models, both of which yield similar results on SimLex (Spearman’s rank correlation of 0.71) and SimVerb ( 0.70). This proves that cosine distances remain preserved during the optimization of the asymmetric objective performed by the joint lear model.

5 Related Work

Word Vectors and Lexical Entailment

Since the hierarchical le relation is one of the fundamental building blocks of semantic taxonomies and hierarchical concept categorisations (Beckwith et al., 1991; Fellbaum, 1998), a significant amount of research in semantics has been invested into its automatic detection and classification. Early work relied on asymmetric directional measures (Weeds et al., 2004; Clarke, 2009; Kotlerman et al., 2010; Lenci and Benotto, 2012, i.a.) which were based on the distributional inclusion hypothesis (Geffet and Dagan, 2005) or the distributional informativeness or generality hypothesis (Herbelot and Ganesalingam, 2013; Santus et al., 2014). However, these approaches have recently been superseded by methods based on word embeddings. These methods build dense real-valued vectors for capturing the LE relation, either directly in the LE-focused space (Vilnis and McCallum, 2015; Vendrov et al., 2016; Henderson and Popa, 2016; Nickel and Kiela, 2017; Nguyen et al., 2017) or by using the vectors as features for supervised LE detection models (Tuan et al., 2016; Shwartz et al., 2016; Nguyen et al., 2017; Glavaš and Ponzetto, 2017).

Several le models embed useful hierarchical relations from external resources such as WordNet into le-focused vector spaces, with solutions coming in different flavours. The model of Yu et al. (2015) is a dynamic distance-margin model optimised for the le detection task using hierarchical WordNet constraints. This model was extended by Tuan et al. (2016) to make use of contextual sentential information. A major drawback of both models is their inability to make directionality judgements. Further, their performance has recently been surpassed by the HyperVec model of Nguyen et al. (2017). This model combines WordNet constraints with the SGNS distributional objective into a joint model. As such, the model is tied to the SGNS objective and any change of the distributional modelling paradigm implies a change of the entire HyperVec model. This makes their model less versatile than the proposed lear framework. Moreover, the results achieved using lear specialisation achieve substantially better performance across all le tasks used for evaluation.

Another model similar in spirit to lear is the order-emb model of Vendrov et al. (2016), which encodes hierarchical structure by imposing a partial order in the embedding space: higher-level concepts get assigned higher per-coordinate values in a -dimensional vector space. The model minimises the violation of the per-coordinate orderings during training by relying on hierarchical WordNet constraints between word pairs. Finally, the Poincaré model of Nickel and Kiela (2017) makes use of hyperbolic spaces to learn general-purpose LE embeddings based on -dimensional Poincaré balls which encode both hierarchy and semantic similarity, again using the WordNet constraints. In this paper, we demonstrate that le-specialised word embeddings with stronger performance can be induced using a simpler model operating in more intuitively interpretable Euclidean vector spaces.

6 Conclusion and Future Work

This paper proposed lear, a vector space specialisation procedure which simultaneously injects symmetric and asymmetric constraints into existing vector spaces, performing joint specialisation for two properties: lexical entailment and semantic similarity. Since the former is not symmetric, lear uses an asymmetric cost function which encodes the hierarchy between concepts by manipulating the norms of word vectors, assigning higher norms to higher-level concepts. Specialising the vector space for both relations has a synergistic effect: lear-specialised vectors attain state-of-the-art performance in judging semantic similarity and set new high scores across four different lexical entailment tasks. The code for the lear model is available from: github.com/nmrksic/lear.

In future work, we plan to apply a similar methodology to other asymmetric relations (e.g., meronymy), as well as to investigate fine-grained models which can account for differing path lengths from the WordNet hierarchy. Porting the model to other languages and enabling cross-lingual applications is another future direction.

Acknowledgments

IV is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909).

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1305
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description