# Specialising Word Vectors for Lexical Entailment

## Abstract

We present lear (**L**exical **E**ntailment **A**ttract-**R**epel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of *lexical entailment* (LE), also known as the is-a or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into the initial vector space, the LE specialisation procedure brings true hyponymy-hypernymy pairs closer together in the transformed Euclidean space. The proposed asymmetric distance measure adjusts the norms of word vectors to reflect the actual WordNet-style hierarchy of concepts. Simultaneously, a joint objective enforces semantic similarity using the symmetric cosine distance, yielding a vector space specialised for both lexical relations at once. lear specialisation achieves state-of-the-art performance in the tasks of hypernymy directionality, hypernymy detection and graded lexical entailment, demonstrating the effectiveness and robustness of the proposed model.

## 1Introduction

Word representation learning has become a research area of central importance in NLP, with its usefulness demonstrated across application areas such as parsing [?], machine translation [?], and many others [?]. Standard techniques for inducing word embeddings rely on the *distributional hypothesis* [?], using co-occurrence information from large textual corpora to learn meaningful word representations [?].

A major drawback of the distributional hypothesis is that it coalesces different relationships, such as synonymy and topical relatedness, into a single vector space. A popular solution is to go beyond stand-alone unsupervised learning and fine-tune distributional vector spaces by using external knowledge from human- or automatically-constructed knowledge bases. This is often done as a *post-processing* step, where distributional vectors are gradually refined to satisfy linguistic constraints extracted from lexical resources such as WordNet [?], the Paraphrase Database (PPDB) [?], or BabelNet [?]. One advantage of post-processing methods is that they treat the input vector space as a *black box*, making them applicable to any input space.

A key property of these methods is their ability to transform the vector space by *specialising* it for a particular relationship between words.^{1}*lexical entailment* (le) relation.

Word-level lexical entailment is an fundamental *asymmetric* semantic relation [?]. It is a key principle determining the organization of semantic networks into hierarchical structures such as semantic ontologies [?]. Automatic reasoning about le supports tasks such as taxonomy creation [?], natural language inference [?], text generation [?], and metaphor detection [?].

Our novel le specialisation model, termed lear (**L**exical **E**ntailment **A**ttract-**R**epel), is inspired by Attract-Repel, a state-of-the-art general specialisation framework [?].^{2}*asymmetric distance measure* which quantifies the LE strength in the specialised space.

After specialising four well-known input vector spaces with lear, we test them in three standard word-level le tasks [?]: **1)** hypernymy *directionality*; **2)** hypernymy *detection*; and **3)** *combined* hypernymy detection/directionality. Our specialised vectors yield notable improvements over the strongest baselines for each task, with each input space, demonstrating the effectiveness and robustness of lear specialisation.

The employed asymmetric distance allows one to make graded assertions about hierarchical relationships between concepts in the specialised space. This property is evaluated using HyperLex, a recent *graded LE* dataset [?]. The lear-specialised vectors push state-of-the-art Spearman’s correlation from 0.540 to 0.686 on the full dataset (2,616 word pairs), and from 0.512 to 0.705 on its noun subset (2,163 word pairs).

## 2Methodology

### 2.1The Attract-Repel Framework

Let be the vocabulary, the set of Attract word pairs (e.g., *intelligent* and *brilliant*), and the set of Repel word pairs (e.g., *vacant* and *occupied*). The Attract-Repel procedure operates over mini-batches of such pairs and . For ease of notation, let each word pair in these two sets correspond to a vector pair , so that a mini-batch of word pairs is given by (similarly for , which consists of example pairs).

Next, the sets of pseudo-negative examples and are defined as pairs of *negative examples* for each Attract and Repel example pair in mini-batches and . These negative examples are chosen from the word vectors present in or so that, for each Attract pair , the negative example pair is chosen so that is the vector closest (in terms of cosine distance) to and is closest to . Similarly, for each Repel pair , the negative example pair is chosen from the remaining in-batch vectors so that is the vector furthest away from and is furthest from .

The negative examples are used to: **a)** force Attract pairs to be closer to each other than to their respective negative examples; and **b)** to force Repel pairs to be further away from each other than from their negative examples. The first term of the cost function pulls Attract pairs together:

where is the hinge loss function and is the attract margin which determines how much closer these vectors should be to each other than to their respective negative examples. The second part of the cost function pushes Repel word pairs away from each other:

In addition to these two terms, an additional regularisation term is used to *preserve* the abundance of high-quality semantic content present in the distributional vector space, as long as this information does not contradict the injected linguistic constraints. If is the set of all word vectors present in the given mini-batch, then:

where is the L2 regularization constant and denotes the original (distributional) word vector for word . The full Attract-Repel cost function is given by the sum of all three terms.

### 2.2Lear: Encoding Lexical Entailment

In this section, the Attract-Repel framework is extended to model lexical entailment jointly with (symmetric) semantic similarity. To do this, the method uses an additional source of external lexical knowledge: let be the set of *directed* lexical entailment constraints such as *(corgi, dog)*, *(dog, animal)*, or *(corgi, animal)*, with lower-level concepts on the left and higher-level ones on the right (the source of these constraints will be discussed in Section 3). The optimisation proceeds in the same way as before, considering a mini-batch of le pairs consisting of word pairs standing in the (directed) lexical entailment relation.

Unlike symmetric similarity, lexical entailment is an asymmetric relation which encodes a hierarchical ordering between concepts. Inferring the direction of the entailment relation between word vectors requires the use of an asymmetric distance function. We define three different ones, all of which use the word vector’s norms to impose an ordering between high- and low-level concepts:

The lexical entailment term (for the -th asymmetric distance, ) is defined as:

The first distance serves as the baseline: it uses the word vectors’ norms to order the concepts, that is to decide which of the words is likely to be the higher-level concept. In this case, the magnitude of the difference between the two norms determines the ‘intensity’ of the le relation. This is potentially problematic, as this distance does not impose a limit on the vectors’ norms. The second and third metric take a more sophisticated approach, using the ratios of the differences between the two norms and either: **a)** the sum of the two norms; or **b)** the larger of the two norms. In doing that, these metrics ensure that the cost function only considers the norms’ ratios. This means that the cost function no longer has the incentive to increase word vectors’ norms past a certain point, as the magnitudes of norm ratios grow in size much faster than the linear relation defined by the first distance function.

To model the semantic and the le relations jointly, the lear cost function jointly optimises the four terms of the expanded cost function:

**Le Pairs as Attract Constraints** The combined cost function makes use of the batch of lexical constraints twice: once in the defined asymmetric cost function , and once in the symmetric Attract term . This means that words standing in the lexical entailment relation are forced to be similar both in terms of cosine distance (via the symmetric Attract term) and in terms of the asymmetric distance from Eq. .

**Decoding Lexical Entailment** The defined cost function serves to encode semantic similarity and le relations in the same vector space. Whereas the similarity can be inferred from the standard cosine distance, the lear optimisation embeds lexical entailment as a combination of the symmetric Attract term and the newly defined asymmetric cost function. Consequently, the metric used to determine whether two words stand in the le relation must combine the two cost terms as well. We define the le *decoding* metric as:

where denotes the cosine distance. This decoding function combines the symmetric and the asymmetric cost term, in line with the combination of the two used to perform lear specialisation. In the evaluation, we show that combining the two cost terms has a synergistic effect, with both terms contributing to stronger performance across all le tasks used for evaluation.

## 3Experimental Setup

**Starting Distributional Vectors** To test the robustness of lear specialisation, we experiment with a variety of well-known, publicly available English word vectors: **1)** Skip-Gram with Negative Sampling (SGNS) [?] trained on the Polyglot Wikipedia [?] by ; **2)** Glove Common Crawl [?]; **3)** context2vec [?], which replaces CBOW contexts with contexts based on bidirectional LSTMs [?]; and **4)** fastText [?], a SGNS variant which builds word vectors as the sum of their constituent character n-gram vectors.^{3}

**Linguistic Constraints** We use three groups of linguistic constraints in the lear specialisation model, covering three different relation types which are all beneficial to the specialisation process: directed **1)** *lexical entailment* (le) *pairs*; **2)** *synonymy pairs*; and **3)** *antonymy pairs*. Synonyms are included as symmetric attract pairs (i.e., the pairs) since they can be seen as defining a trivial symmetric is-a relation [?]. For a similar reason, antonyms are clear repel constraints as they anti-correlate with the LE relation.^{4}^{5}

As in prior work [?], le constraints are extracted from the WordNet hierarchy, relying on the transitivity of the le relation. This means that we include both direct and indirect le pairs in our set of constraints (e.g., (*pangasius, fish*), (*fish, animal*), and *(pangasius, animal)*). We retained only noun-noun and verb-verb pairs, while the rest were discarded: the final number of le constraints is 1,545,630.^{6}

**Training Setup** We adopt the original Attract-Repel model setup without any fine-tuning. Hyperparameter values are set to: , , [?]. The models are trained for 5 epochs, with batch sizes set to for faster convergence.

## 4Results and Discussion

We test and analyse lear-specialised vector spaces in two standard word-level le tasks used in prior work: hypernymy directionality and detection (Section Section 4.1) and graded le (Section Section 4.2).

### 4.1LE Directionality and Detection

The first evaluation uses three classification-style tasks with increased levels of difficulty. The tasks are evaluated on three datasets used extensively in the le literature [?], compiled into an integrated evaluation set by .^{7}

The first task, le directionality, is conducted on 1,337 le pairs originating from the bless evaluation set [?]. Given a true le pair, the task is to predict the correct hypernym. With lear-specialised vectors this is achieved by simply comparing the vector norms of each concept in a pair: the one with the larger norm is the hypernym (see Figure 1).

The second task, le detection, involves a binary classification on the wbless dataset [?] which comprises 1,668 word pairs standing in a variety of relations (le, meronymy-holonymy, co-hyponymy, reversed le, no relation). The model has to detect a true le pair, that is, to distinguish between the pairs where the statement * is a (type of) * is true from all other pairs. With lear vectors, this classification is based on the asymmetric distance score: if the score is above a certain threshold, we classify the pair as “true le”, otherwise as “other”. While manually define the threshold value, we follow the approach of and cross-validate: in each of the 1,000 iterations, 2% of the pairs are sampled for threshold tuning, and the remaining 98% are used for testing. The reported numbers are therefore average accuracy scores.^{8}

The final task, le detection *and* directionality, concerns a three-way classification on bibless, a relabeled version of wbless. The task is now to distinguish both le pairs () and reversed le pairs () from other relations ( 0), and then additionally select the correct hypernym in each detected le pair. We apply the same test protocol as in the le detection task.

**Results and Analysis** The original paper of reports the following best scores on each task: 0.88 (bless), 0.75 (wbless), 0.57 (bibless). These scores were recently surpassed by , who, instead of post-processing, combine WordNet-based constraints with an SGNS-style objective into a joint model. They report the best scores to date: 0.92 (bless), 0.87 (wbless), and 0.81 (bibless).

The performance of the four lear-specialised word vector collections is shown in Figure ? (together with the strongest baseline scores for each of the three tasks). The comparative analysis confirms the increased complexity of subsequent tasks. lear specialisation of *each* of the starting vector spaces consistently outperformed *all* baseline scores across *all* three tasks. The extent of the improvements is correlated with task difficulty: it is lowest for the easiest directionality task (), and highest for the most difficult detection plus directionality task ().

The results show that the two lear variants which do not rely on absolute norm values and perform a normalisation step in the asymmetric distance (D2 and D3) have a slight edge over the D1 variant which operates with unbounded norms. The difference in performance between D2/D3 and D1 is more pronounced in the graded LE task (see Section 4.2). This shows that the use of unbounded vector norms diminishes the importance of the symmetric cosine distance in the combined asymmetric distance. Conversely, the synergistic combination used in D2/D3 does not suffer from this issue.

The high scores achieved with each of the four word vector collections show that lear is not dependent on any particular word representation architecture. Moreover, the extent of the performance improvements in each task suggests that lear is able to reconstruct the concept hierarchy coded in the input linguistic constraints.

Norm |
Norm |
Norm |
|||

(lr)2-2 (lr)3-4 (lr)5-6 terrier | 0.87 | laptop | 0.60 | cabriolet | 0.74 |

dog | 2.64 | computer | 2.96 | car | 3.59 |

mammal | 8.57 | machine | 6.15 | vehicle | 7.78 |

vertebrate | 10.96 | device | 12.09 | transport | 8.01 |

animal | 11.91 | artifact | 17.71 | instrumentality | 14.56 |

organism | 20.08 | object | 23.55 | – | – |

**Further Discussion** To verify that the knowledge concerning the position in the semantic hierarchy actually arises from vector norms, we also manually inspect the norms after lear specialisation. A few examples are provided in Table 1. They indicate a desirable pattern in the norm values which imposes a hierarchical ordering on the concepts. Note that the original distributional SGNS model [?] does not normalise vectors to unit length after training. However, these norms are not at all correlated with the desired hierarchical ordering, and are therefore useless for le-related applications: the non-specialised distributional SGNS model scores 0.44, 0.48, and 0.34 on the three tasks, respectively.

### 4.2Graded Lexical Entailment

Asymmetric distances in the lear-specialised space quantify the degree of lexical entailment between any two concepts. This means that they can be used to make fine-grained assertions regarding the hierarchical relationships between concepts. We test this property on HyperLex [?], a gold standard dataset for evaluating how well word representation models capture graded le, grounded in the notions of *concept (proto)typicality* [?] and *category vagueness* [?] from cognitive science. HyperLex contains 2,616 word pairs (2,163 noun pairs and 453 verb pairs) scored by human raters in the interval following the question *“To what degree is X a (type of) Y?”*^{9}

As shown by the high inter-annotator agreement on HyperLex (0.85), humans are able to consistently reason about graded le.^{10}

**Results and Analysis** We scored all HyperLex pairs using the combined asymmetric distance described by Equation , and then computed Spearman’s rank correlation with the ground-truth ranking. Our results, together with the strongest baseline scores, are summarised in Figure ?.

The summary table in Figure ?(c) shows the HyperLex performance of several prominent le models. We provide only a quick outline of these models here; further details can be found in the original papers. freq-ratio exploits the fact that more general concepts tend to occur more frequently in textual corpora. sgns (cos) uses non-specialised SGNS vectors and quantifies the LE strength using the symmetric cosine distance between vectors. A comparison of these models to the best-performing lear vectors shows the extent of the improvements achieved using the specialisation approach.

lear-specialised vectors also outperform slqs-sim [?] and visual [?], two le detection models similar in spirit to lear. These models combine symmetric semantic similarity (through cosine distance) with an asymmetric measure of lexical generality obtained either from text (slqs-sim) or visual data (visual). The results on HyperLex indicate that the two generality-based measures are too coarse-grained for graded le judgements. These models were originally constructed to tackle le directionality and detection tasks (see Section 4.1), but their performance is surpassed by lear on those tasks as well. The visual model outperforms slqs-sim. However, its numbers on bless (0.88), wbless (0.75), and bibless (0.57) are far from the top-performing lear vectors (0.96, 0.92, 0.88).

wn-best denotes the best result with asymmetric similarity measures which use the WordNet structure as their starting point [?]. The reported results suggest it is more effective to use WordNet as the source of constraints for specialisation models.

word2gauss [?] represents words as multivariate -dimensional Gaussians rather than points in the embedding space: it is therefore naturally asymmetric and was used in le tasks before, but its performance on HyperLex indicates that it cannot effectively capture the subtleties required to model graded le.

Most importantly, lear outperforms three recent (and conceptually different) architectures: order-emb [?], Poincaré [?], and HyperVec [?]. Like lear, all of these models complement distributional knowledge with external linguistic constraints extracted from WordNet. Each model uses a different strategy to exploit the hierarchical relationships encoded in these constraints (their approaches are discussed in Section 5). However, lear, as the first le-oriented post-processor, is able to utilise the constraints more effectively than its competitors. Another advantage of lear is its applicability to any input vector space.

Figures ?(a) and ?(b) indicate that the two lear variants which rely on norm ratios (D2 and D3), rather than on absolute (unbounded) norm differences (D1), achieve stronger performance on HyperLex. The highest correlation scores are again achieved by D2 with all input vector spaces.

**Why Symmetric + Asymmetric?** In another experiment, we analyse the contributions of both le-related terms in the lear combined objective function (see Section 2.2). We compare three variants of lear: **1)** a symmetric variant which does not arrange vector norms using the term (sym-only); **2)** a variant which arranges norms, but does not use le constraints as additional symmetric attract constraints (asym-only); and **3)** the full lear model, which uses both cost terms (full). The results with one input space (similar results are achieved with others) are shown in Table ?. This table shows that, while the stand-alone asym-only term seems more beneficial than the sym-only one, using the two terms jointly yields the strongest performance across all LE tasks.

wbless | bibless | hl-a | hl-n | |

(lr)2-5 lear variant |
||||

sym-only | 0.687 | 0.679 | 0.469 | 0.429 |

asym-only | 0.867 | 0.824 | 0.529 | 0.565 |

full | 0.912 |
0.875 |
0.686 |
0.705 |

**Le and Semantic Similarity** We also test whether the asymmetric term harms the (norm-independent) cosine distances used to represent semantic similarity. The lear model is compared to the original attract-repel model making use of the same set of linguistic constraints. Two true semantic similarity datasets are used for evaluation: SimLex-999 [?] and SimVerb-3500 [?]. There is no significant difference in performance between the two models, both of which yield similar results on SimLex (Spearman’s rank correlation of 0.71) and SimVerb ( 0.70). This proves that cosine distances remain preserved during the optimization of the asymmetric objective performed by the joint lear model.

## 5Related Work

**Word Vectors and Lexical Entailment** Since the hierarchical le relation is one of the fundamental building blocks of semantic taxonomies and hierarchical concept categorisations [?], a significant amount of research in semantics has been invested into its automatic detection and classification. Early work relied on asymmetric directional measures [?] which were based on the distributional inclusion hypothesis [?] or the distributional informativeness or generality hypothesis [?]. However, these approaches have recently been superseded by methods based on word embeddings. These methods build dense real-valued vectors for capturing the LE relation, either directly in the LE-focused space [?] or by using the vectors as features for supervised LE detection models [?].

Several le models embed useful hierarchical relations from external resources such as WordNet into le-focused vector spaces, with solutions coming in different flavours. The model of is a dynamic distance-margin model optimised for the le detection task using hierarchical WordNet constraints. This model was extended by to make use of contextual sentential information. A major drawback of both models is their inability to make directionality judgements. Further, their performance has recently been surpassed by the HyperVec model of . This model combines WordNet constraints with the SGNS distributional objective into a joint model. As such, the model is tied to the SGNS objective and any change of the distributional modelling paradigm implies a change of the entire HyperVec model. This makes their model less versatile than the proposed lear framework. Moreover, the results achieved using lear specialisation achieve substantially better performance across all le tasks used for evaluation.

Another model similar in spirit to lear is the order-emb model of , which encodes hierarchical structure by imposing a partial order in the embedding space: higher-level concepts get assigned higher per-coordinate values in a -dimensional vector space. The model minimises the violation of the per-coordinate orderings during training by relying on hierarchical WordNet constraints between word pairs. Finally, the Poincaré model of makes use of hyperbolic spaces to learn general-purpose LE embeddings based on -dimensional Poincaré balls which encode both hierarchy and semantic similarity, again using the WordNet constraints. In this paper, we demonstrate that le-specialised word embeddings with stronger performance can be induced using a simpler model operating in more intuitively interpretable Euclidean vector spaces.

## 6Conclusion and Future Work

This paper proposed lear, a vector space specialisation procedure which simultaneously injects symmetric and asymmetric constraints into existing vector spaces, performing joint specialisation for two properties: *lexical entailment* and *semantic similarity*. Since the former is not symmetric, lear uses an asymmetric cost function which encodes the hierarchy between concepts by manipulating the norms of word vectors, assigning higher norms to higher-level concepts. Specialising the vector space for both relations has a synergistic effect: lear-specialised vectors attain state-of-the-art performance in judging semantic similarity and set new high scores across four different lexical entailment tasks. The code for the lear model is available from: github.com/nmrksic/lear.

In future work, we plan to apply a similar methodology to other asymmetric relations (e.g., *meronymy*), as well as to investigate fine-grained models which can account for differing path lengths from the WordNet hierarchy. Porting the model to other languages and enabling cross-lingual applications is another future direction.

## Acknowledgments

IV is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909).

### Footnotes

- Distinguishing between synonymy and antonymy has a positive impact on real-world language understanding tasks such as Dialogue State Tracking [?].
- https://github.com/nmrksic/attract-repel
- All vectors are -dimensional except for the -dimensional context2vec vectors; for further details regarding the architectures and training setup of the used vector collections, we refer the reader to the original papers.
- In short, the question “
*Is a type of ?*” (synonymy) is trivially true, while the question “*Is a type of ?*” (antonymy) is trivially false. - https://github.com/tticoin/AntonymDetection
- We also experimented with additional 30,491 le constraints from the Paraphrase Database (PPDB) 2.0 [?]. Adding them to the WordNet-based le pairs makes no significant impact on the final performance. We also used synonymy and antonymy pairs from other sources, such as word pairs from PPDB used previously by , and BabelNet [?] used by , reaching the same conclusions.
- http://www.cl.cam.ac.uk/dk427/generality.html
- We have conducted more le directionality and detection experiments on other datasets such as EVALution [?], the dataset of , and the dataset of with similar performances and findings. We do not report all these results for brevity and clarity of presentation.
- From another perspective, one might say that graded le provides finer-grained human judgements on a continuous scale rather than simplifying the judgements into binary discrete decisions. For instance, the HyperLex score for the pair
*(girl, person)*is 5.91/6, the score for*(guest, person)*is 4.33, while the score for the reversed pair*(person, guest)*is 1.73. - For further details concerning HyperLex, we refer the reader to the resource paper [?]. The dataset is available at:
*http://people.ds.cam.ac.uk/iv250/hyperlex.html*