Analyzing Structures in the Semantic Vector Space: A Framework for Decomposing Word Embeddings

Analyzing Structures in the Semantic Vector Space: A Framework for Decomposing Word Embeddings

Abstract

Word embeddings are rich word representations, which in combination with deep neural networks, lead to large performance gains for many NLP tasks. However, word embeddings are represented by dense, real-valued vectors and they are therefore not directly interpretable. Thus, computational operations based on them are also not well understood. In this paper, we present an approach for analyzing structures in the semantic vector space to get a better understanding of the underlying semantic encoding principles. We present a framework for decomposing word embeddings into smaller meaningful units which we call sub-vectors. The framework opens up a wide range of possibilities analyzing phenomena in vector space semantics, as well as solving concrete NLP problems: We introduce the category completion task and show that a sub-vector based approach is superior to supervised techniques; We present a sub-vector based method for solving the word analogy task, which substantially outperforms different variants of the traditional vector-offset method.

\aclfinalcopy

1 Introduction

Word embeddings are word representations that are based on the distributional hypothesis Harris (1954) and express the meaning of a word by a vector. Due to their expressive power, word embeddings became very popular, and in combination with deep neural networks, significant performance gains have been achieved in many NLP tasks Collobert et al. (2011); Hirschberg and Manning (2015); Young et al. (2018); Devlin et al. (2018). Modern approaches for learning word embeddings are based on the idea of predicting words in a local context window Mikolov et al. (2013a). As a result, low-dimensional vectors are obtained that capture rich semantic information of a word as has been demonstrated by numerous studies (e.g., Mikolov et al. 2013c, Li et al. 2015).

Nevertheless, since word embeddings are represented by dense, real-valued vectors, the encoded information cannot be directly interpreted. As a result, the computational operations in neural networks based on word embeddings are also not well understood. To address the issue and make word embeddings more interpretable, a number of different methods have been proposed: Rotation of the word embeddings to align the dimension of the vectors with certain attributes Jang and Myaeng (2017); Zobnin (2017); transformation of word embeddings into a sparse higher dimensional space where individual dimension represent attributes of word embeddings Murphy et al. (2012); Fyshe et al. (2014); Faruqui et al. (2015); deriving interpretable distributed representations using lexical resources Koç et al. (2018).

Even though the goal of making word embeddings more transparent is valuable in itself, we argue that identifying structures in the semantic vector space is a more beneficial objective. We believe that this would give us a better theoretical understanding of the underlying semantic encoding principles and lead to a more theoretically driven research of distributed representations.

From this perspective, we consider the following studies as being of particular importance. The vector-offset method introduced by Mikolov et al. (2013c) allows solving the word analogy task Jurgens et al. (2012) on the basis of offsets between two word vectors. In fact, the vector-offset analogy is a persistent pattern in the semantic vector space and holds for a broad range of relations Vylomova et al. (2015). Follow up work has shown that a word vector can be decomposed into a linear combination of constituent vectors that represent the attributes of this word vector. Rothe and Schütze (2015) decomposed word embeddings using WordNet Miller (1995) into representations of lexemes. Cotterell et al. (2016) presented an approach for deriving representations of morphemes using lexical resources. These representations can be combined to predict word embeddings for rare words in languages with rich inflectional morphology. Arora et al. (2018) used sparse coding to derive 2000 elementary vectors from word embeddings called discourse atoms. Going beyond these approaches, in this paper, we further elaborate the assumption that word embeddings are linearly composed from smaller vectors.

The contributions of this work are as follows: (1) We introduce a framework for decomposing word embeddings which does not require lexical resources and is not restricted to a fixed set of elementary vectors. Instead, we are able to decompose word vectors into an arbitrary number of vectors which we call sub-vectors, by contrasting different word embeddings whith each other. The approach is simple and easy to implement, but at the same time very flexible, as we are able to control which properties we would like to extract from a word vector. For a rigorous definition of the framework, we introduce the distributional decomposition hypothesis, which defines the rules for a legitimate decomposition of a word vector.

(2) On the basis of the introduced hypothesis, we propose semantic space networks (SSNs), which is an approach for decomposing word embeddings in a systematic manner. Using SSNs, we analyze semantic and grammatical categories and are able to identify sub-vectors capturing different attributes of words, such as gender, number, and tense.

(3) We also show that SSNs can be used in a weakly supervised setting. Given a number of instances of a category, the method allows us to retrieve other words belonging to the same category with high precision. We frame this problem setting as the category completion task and present two corpora for the problem setting. We evaluate the performance of our approach on a newly constructed corpus and demonstrate that the method is much more data-efficient compared to supervised methods. Moreover, we present an algorithm to derive sub-vectors for solving the word analogy task Jurgens et al. (2012) and show that the method outperforms different variants of the vector offset method Mikolov et al. (2013c).

The presented framework opens up a wide range of new possibilities for analyzing different phenomena in vector space semantics but also solving concrete NLP problems. In this study, we apply the method to traditional word embeddings, such as word2vec Mikolov et al. (2013a) and GloVe Pennington et al. (2014). However, the approach is not restricted to static word embeddings and can also be used to analyze contextualized word embeddings, such as those derived by ELMo Peters et al. (2018) and BERT Devlin et al. (2018). Our data and the source code will be made publicly available for future research1.

2 Compositionality of word embeddings

Modern approaches for learning word embeddings are based on predicting a word given its context or vice versa Mikolov et al. (2013a). The skip-gram model is the most prominent modern approach and is trained by predicting the context of a word using a neural model. It shall serve here as an example model for the following discussion. Nevertheless, other prediction based models, such as GloVe or Dependency-Based Word Embeddings Levy and Goldberg (2014a), are trained in a similar manner and the following discussion also applies to them.

The goal of the skip-gram model is to maximize the probability

(1)

where is the vector of the considered word and is the vector of a context word 2. The vectors represent words, which do not occur in the context of , and serve as negative examples. The more frequently a word appears in a particular context, the larger the vector gets, which maximizes the probability of this context Schakel and Wilson (2015). The context thereby can be defined as a set of words which form a consistent context window. The context words mother and father, for example, are expected to drive a word vector in a similar direction, whereas mother and vehicle are expected to push the vector to two different points in the semantic space. The direction of the resulting vector is therefore associated with its context which also defines its meaning. The length of the vector, on the other hand, expresses how often it occurs within a particular context window and it can be therefore interpreted as the magnitude of the meaning associated with the context window. However, words are often polysemous and carry different meanings. This implies that they appear in different contexts, and when the word vector is updated during training, it is simultaneously driven in different directions. On the basis of this observation, we argue that the resulting word vector, and therefore its meaning, is the sum of the vectors representing its different meanings. If a vector representing a particular meaning of a word is more prominent than its other meanings, this meaning is stronger represented in the resulting word vector 3.

Moreover, it must be noted that different contexts are often incompatible, that is, the corresponding vectors are pointing in opposing directions in the semantic vector space. Thus, some of the vector components will cancel each other out and the length of the resulting word vector will be reduced. This phenomenon is particularly pronounced for frequent words, as these are more often polysemous than less frequent words, as observed by Schakel and Wilson (2015). As a consequence, the resulting word vectors can be considered as losing a part of the words’ meaning. We consider the different meanings of a word encoded in a word vector not as discrete separable lexeme vectors Rothe and Schütze (2017) but rather as a continuum of an infinite number of sub-vectors. We believe that the transition between different context windows is not necessarily discrete. Polysemy, homonymy, metaphor, metonymy, and vagueness Lakoff (2008) give rise to a continuum of meanings a word can take which extends through the semantic vector space, rather than forming a number of clearly separable vectors representing the synsets of a word. Thus, a word vector can be decomposed into an infinite number of sets of sub-vectors, each of which represents the attributes encoded in a word vector in a different manner. Nevertheless, from our perspective, not all possible decompositions of a word-vector are reasonable, thus, we formally define the properties of a meaningful set of sub-vectors in the distributional decomposition hypothesis below.

Distributional decomposition hypothesis:

A word vector can be linearly decomposed into a finite, meaningful set of sub-vectors

(2)

A vector is considered to be a sub-vector of the word vector , if the projection of onto the vector of is greater than or equal to the length of the sub-vector , i.e.

(3)

We further define the set of word vectors, of which is a sub-vector, as the set of its children, i.e.

(4)

A sub-vector is considered to be a meaningful representation since it is shared by multiple or at least one word vector, which are indicative of its meaning. More concretely, the meaning of a sub-vector can be derived from the properties that all of its children have in common.

3 Decomposition of word embeddings

In this section, we introduce a framework for systematically decomposing word embeddings, which is in agreement with the distributional decomposition hypothesis.

3.1 Semantic tree model

The semantic tree model allows us to decompose word embeddings but also analyse different linguistic phenomena. In this sub-section, we introduce the semantic tree model and analyze phenomena like categorization and grammatical categories. Phenomena like antonymy, polysemy, and hypernymy are analyzed in the Appendix A.2.

Semantic tree model

A chosen set of word vectors, which are used to define a semantic tree, will be referred to as support vectors .

The semantic tree for two word embeddings is schematically illustrated in Figure 1. The support vectors of the semantic tree in Figure 1 are the dashed word vectors and .

Figure 1: Semantic tree model in two different representations: (left: vector representation, right: network representation) and represented by the dashed lines are the support vectors, is the root and is the projected vector onto , and are the branch sub-vectors, the vector offset, and the vector from the origin to the branches, thus = .

The sub-vector is defined as the largest vector which is shared by all support vectors. It will be therefore also denoted as the of the tree. Since is shared by all support vectors, it represents an attribute which all these word vectors have in common. In order to determine , we first need to find its unit vector , which will be defined as the unit vector of the sum of the support vectors, i.e.

(5)

Mikolov et al. (2013b) argue that additive composition of word vectors is a good representation of larger text units and link this property to the skip-gram training objective. In fact, an average word embeddings representation of a sentence is a hard to beat baseline Rücklé et al. (2018). Thus, we use the direction of the word vector sum as the direction of the shared sub-vector.

The sub-vector is defined as the smallest projected vector onto of all the support vectors:

(6)
(7)

For the sake of simplicity, the operation of deriving the root will be denoted as = root(). The support vectors do not necessarily correspond to the children of the sub-vector , as also other word vectors might share , thus ch(). The branches of the tree are defined as the vectors leading from the top of the root to the individual support vectors. They are therefore computed by simply subtracting the root from the individual word vectors, i.e. . In Figure 1, the two branches are denoted as and . The branches are also sub-vectors, but in contrast to the root, they represent individual attributes of the support vectors. The procedure of deriving the branches will be denoted as = branch() henceforth.

In addition to the branches, we define the vector offset of the tree. The vector offset results from the subtraction of one support vector from another and is identical to the vector offset used to solve the word analogy task by Mikolov et al. (2013c).

The derived semantic tree can be viewed as a semantic network where the nodes , …, and represent concepts which are linked by the edges , …, and . Whereas the support vectors , …., represent concrete concepts, is an abstract concept, which captures the attributes shared by the support vectors , …, . In case we have only two support vectors that are also of equal length, which for example can be obtained by normalization, the two branches are of equal length and are pointing exactly in the opposite directions, i.e.

(8)

This implies that their meanings are opposed to each other. In such a case, the two branches are also orthogonal to the root, i.e. , which means that the attributes they describe are unrelated.

The proposed model allows us to separate different properties of word vectors and therefore, analyze what kind of information is encoded in individual word vectors. Thus, in contrast to the cosine similarity, much more nuanced relationships between words can be discovered. Moreover, the model allows us to identify sub-spaces in the semantic space containing sets of words and we are therefore able to perform set-theoretic operations. In order to illustrate the advantages of the approach, below, we analyze different lexical relations using the semantic tree model. The experimental details are given in the Appendix A.1.

Categorization

The derivation of the root of a semantic tree can be viewed as defining the properties of a category. The properties shared by all support vectors are thereby taken as the properties of the category, and the set of the children ch(), which share these properties, represents the members of the category.

In the Example 1 below, a number of categories are formed using this approach. E.g., the sub-vector , which the word vectors have in common, is also shared by eight other word vectors representing the eight remaining months, but no other word vector. Thus, we are able to form a category on the basis of a couple of examples.

Example 1.

= [];
= root(); ch() = 12
ch = [

]

= [];
= root(); ch() = 9;
ch = [
];

= [];
= root(); ch() = 6;
ch= [];

= [];
= root(); ch() = 24;
ch = [
];

Meaning of semantic tree branches

As discussed in Section 3.1, the branches of a semantic tree capture specific meanings of the individual support vectors. This phenomenon is illustrated in Example 2. E.g., the branch sub-vector in Example 2 leading from the root to the word vector is also a sub-vector of the word vectors and . Thus, the words Spain, Barcelona and Madrid are indicative of the meaning of the derived branch sub-vector.

Example 2.
= []
= root(); ch() = 52;
ch() = [
]

ch(
ch(
ch(

ch(

ch(

Grammatical categories

In Example 3, we use the semantic tree model in order to identify branch sub-vectors which represent grammatical categories.

Example 3.
Tense:
= [];
ch() =
ch()=

Comperatives, superlatives:
= [];
ch(
ch(

ch(

Plural, singular:
= [];
ch(
ch(

3.2 Semantic space networks

The semantic tree model allows decomposing a word vector by splitting it up into the two sub-vectors. However, the derived sub-vectors can be further decomposed into more fine-grained representations, whereby the two derived sub-vectors serve as support vectors for further semantic trees. The derived representations are also sub-vectors but describe more subtle properties compared to the original sub-vectors. Using this technique, an arbitrary number of trees can be constructed based on the derived sub-vectors in each case. The derived trees share sub-vectors and can give rise to networks of arbitrary complexity, which we call semantic space networks (SSNs).

In order to illustrate our approach, in this subsection, we present a constructed binary tree as one possible combination of semantic trees. Further examples can be found in the appendix A.3.

Binary tree

The roots of two semantic trees can serve as support vectors for a third tree, which gives rise to a binary tree. Such a tree structure is schematically illustrated in Figure 2.

Figure 2: Binary tree network: Given two pairs of support vectors and , two semantic trees can be constructed. Their roots can then be used as support vectors for a third tree giving rise to a binary tree.

Example 4.
If we choose the four support vectors = for the nodes , , , of the binary tree, we obtain the following sub-vectors:
ch() =

ch() =
ch() =
ch() =

ch() =

ch() =

ch() =

ch() =

ch() =
The sub-vector represents a concept referring to a wide range of different family relations. The sub-vector represented by the node () refers to parenthood and the sub-vector () refers to concepts related to brotherhood and sisterhood. The sub-vectors and more specifically refer to these attributes, whereby the properties represented by are omitted. For some noisy terms are introduced. The sub-vectors describe more specific attributes mostly associated with gender information. It must be noted that the tree represents a hierarchy of concepts. The sub-vector can be considered as a hypernym of the concepts , , , , , and . The sub-vector can be viewed as a hypernym of the concepts , , , and the sub-vector a hypernym of the concepts , , . The, sub-vectors , , can be considered as hypernyms of , and , respectively. Further examples of SSNs in the appendix A.3 demonstrate, how sub-vectors can be further decomposed by combining word vectors in different configurations. Thus, it can be illustrated, which meanings the word vectors have in common and which are distinct.

4 Experiments

In this section, we use SSNs to solve a categorization task and the word analogy problem. The results of the experiments are discussed in the last sub-section

4.1 Categorization

Given a number of concepts, humans are able to construct ad hoc categories Barsalou (1983), which are based on attributes that the given concepts share. In contrast to formal reasoning systems based on knowledge bases, which come with a predefined set of categories, humans are able to form an unlimited number of new categories which allows them to solve problems in new situations. In order to address this problem by machine learning systems, we define the category completion task. SSNs naturally address the category completion task, as the root of a semantic tree defines the attributes which are shared by a number of concepts. It therefore defines the criteria according to which the remaining members of the category can be found. As presented in Example 1, the shared sub-vector of the word embeddings can for instance be used to recover the remaining the eight months which are the other members of the month category.

In this section, we perform category completion experiments on two corpora: a newly constructed closed-set category corpus and a corpus of categories based on the Google word analogy corpus Mikolov et al. (2013a).

Corpora

We introduce a closed-set category corpus with 13 categories: world_countries, months, weekdays, digits, rainbow_colors, planets, family_relations, personal_pronouns, world_capitals, us_states, modals, possessives, question_words, . The number of instances ranges from 7 members as in the category _colors to 116 members as in the category world_countries. The total number of instances is 374.

The Google analogy corpus contains 28 categories with a total number of 1146 instances. Here closed set categories, such as world_countries and world_capitals, are mixed with open set categories, such as common_countries or sets of adjectives and adverbs. The latter cases are difficult to solve since the number of instances is larger than the given sets of words, and the category boundaries are fuzzy.

Categorization experiments

% data 10 20 30 40
baseline .182 .333 .461 .571
closed category corpus:
SSNs .349 .494 .646 .678
SVM .443 .435 .361 .311
SVM .582 .613 .585 .566
Google analogy corpus:
SSNs .282 .406 .329 .305
SVM .357 .267 .218 .201
SVM .468 .474 .429 .395
Table 1: F1 scores for SSNs and SVM with different numbers of negative samples (indicated by the subscript number)

To be able to run comparable experiments for categories with a different number of instances, we provide a certain percentage of instances as example data for each category instead of giving the same number of instances in each case, e.g.: in case we want to perform experiments for 25% of the data, we provide two planet names out of eight instances in the planet category to find the remaining six planets or three month names from the month category to predict the remaining nine. In the experiments, we restrict ourselves to a vocabulary of 50,000 to omit rare and noisy word embeddings. The defined problem is challenging as the models need to identify a small number of words out of 50,000 instances.
In Table 1, the performance of the support vector machine (SVM) classifier is compared to the performance of the SSNs on the two corpora using the GloVe word embeddings. The SVM is superior to other classifiers, such as Logistic Regression, Random Forests, and K-Nearest Neighbors, on this task and was therefore chosen as a baseline. We compute the F1 scores by comparing the example data in combination with the predicted new instances to the entire set of instances of the considered category. The baseline results illustrated in Table 1 are obtained by simply considering the example data without predicting any additional instances. More concretely, we consider the example data as the prediction and take all instances of the category (including the example data) for the evaluation. To be able to solve the task with an SVM classifier, we consider one category at a time. We split the entire vocabulary into two classes, the instances of the considered category and the remaining words of the vocabulary. We then train the SVM on the word vectors of the example data and additional samples from the remaining words in the vocabulary which we call negative samples (indicated by the subscript number 100 and 500). The trained SVM is then used to classify all of the word vectors in the vocabulary whether they belong to the considered category or not. In order to reduce the variance of the results for the SVM and the SSNs, we report the mean values of 5 experiment in each of which we randomly exchange the instances in the training and the testing set. The results in Table 1 show that compared to the closed category corpus, the performance on the Google analogy corpus is significantly lower. This is due to the problem of open set categories described above. The results also demonstrate that the SVM requires a large number of negative samples (in addition to example data) in order to reach equivalent performance to SSNs (which only rely on the example data). SSNs are therefore much more data efficient and require about two orders of magnitude less examples than the SVM model.

4.2 Word analogy task

The traditional word analogy task is defined such that given two related words from different categories and , such as and , and given a third word from the first category , such as , one needs to find a word from the second category which is in the same relation to as is to . In the discussed example, Berlin satisfies the inferred relation and therefore corresponds to . We solve the word analogy task by constructing SSNs and extracting sub-vectors which are suitable to predict . E.g., to find the vector for Berlin, we need to find the sub-vectors representing the abstract concepts of capital and German. To derive capital, we can take root sub-vectors from all the members in the category capitals (e.g., Paris, Rome, Moscow, …). To define German, we can take all state names and take the branch vector for Germany. This method will be denoted as SSNbranch. However, branches or individual word vectors contain idiosyncrasies which hurt the performance on the word analogy task Drozd et al. (2016). In order to remove idiosyncrasies from the branch sub-vectors, we define Algorithm 1 described below.

1:Input: where
2:Desired output:
3:for in not equal to :
4: = branch
5:root_ = root(, …, , …, )
6:root = root()
7: = root_ + root
Algorithm 1 Filtering algorithm

The algorithm computes branches for trees (steps 2 and 3). Thereby, the word vector is combined with any other word vector to form trees from which the branches are extracted. Next, the root of these branches is determined (step 5). This branch represents a filtered version of the branch . In step 6, the root of the category is computed. Finally, an approximation of the word vector is obtained (step 7). The resulting method will be denoted as SSNfilter.
We also compare the results to three variants of the vector-offset method: (i) the traditional vector-offset method (VecOfAdd) Mikolov et al. (2013c), (ii) a definition of the problem as a linear combination of three pairwise word similarities (VecOfMul) Levy and Goldberg (2014b), and (iii) taking the average offset vector of the given example pairs (VecOfAvr) Drozd et al. (2016), The results are illustrated in Table 24. As can be noticed, the vector-offset average method VecOfAvr is superior to VecOfAdd and VecOfMul. Only relying on the original branch sub-vectors using SSNbranch yields a worse performance compared to the vector-offset methods. However, when applying filtering in SSNfilter, we are able to substantially outperform the vector-offset methods. Since the problem setting is deterministic, the results are significant. Compared to the traditional vector offset methods VecOfAdd and VecOfMul, VecOfAvr and SSNs based methods are using additional information in the form of all the given instances from the categories and . This allows the methods to remove idiosyncrasies and improve performance.

method GloVe word2vec
VecOfAdd .717 .726
VecOfMul .725 .739
VecOfAvr .754 .740
SSNbranch .620 .588
SSNfilter .797 .781
Table 2: Comparison of different methods on the Google word analogy task

4.3 Discussion

The presented experiments show that we can represent various linguistic properties by sub-vectors, and achieve superior performance compared to other approaches. Nevertheless, in our error analysis we have observed that in many cases, we cannot perfectly isolate the features of word embeddings. There are often words contained in a derived category, which a human would not assign, (see for instance Example 4: Photo, and joins are included in the brother-sister category). If we use a larger vocabulary of words (a vocabulary of more than 50k words), even more foreign words are included in the derived categories.

On the other hand, the categories almost always contain fewer words than would actually belong to them, e.g. the categories comparatives, superlatives and plural in Example 3. These observations suggest, that the different attributes are not perfectly represented in the semantic vector space, and the rarer a word is, the less likely it is that it will be contained in the appropriate category. We suspect that these problems have two different causes: (1) The semantic spaces are irregular in some regions of the space and are therefore deficient. (2) The attributes are represented in the semantic vector space but are not linearly separable. Given the fact that our methods works for a large number of word embeddings, we believe that in principle, a vector space can be derived where all the attributes are linearly separable, and this should be explored in future work.

5 Related Work

As outlined in the introduction, there has been much work on making word embeddings more interpretable. Here, we restrict ourselves to a few studies which are most related to the analysis of sub-word representations.

Yaghoobzadeh and Schütze (2016) propose a framework for intrinsic evaluation of word embeddings, in which they evaluate whether a desired feature is present in a word vector using an SVM classifier. Rothe and Schütze (2017) present a system for learning embeddings for non-word objects like synsets, lexemes, and entities for lexical resources such as WordNet. Cotterell et al. (2016) develop a method for deriving word vectors for rare words for languages with rich inflectional morphology. They rely on morphological resources to derive representations for morphemes which are then linearly combined to predict a representation of a rare inflection of a word. Nevertheless, we believe that word vectors possess much more information than can be extracted using lexical resources, and that the proposed decomposition using SSNs allows for a more fine-grained analysis of word embeddings. Rothe et al. (2016) introduce a new method for transforming word embeddings into a dense low-dimensional space where the features of interest are represented in each separate dimension. Arora et al. (2018) present an approach to derive vectors representing different senses of an ambiguous word. They assume that a word vector of a polysemous word is a weighted linear combination of its other meanings, which is in agreement with our discussion in Section 2. Nevertheless, our assumption goes further since we believe that the decomposition of word vectors allows us to analyze the properties encoded in word vectors in general and not just the different senses of an ambiguous word.
The vector-offset method introduced by Mikolov et al. (2013c) directly relates to our vector decomposition approach as the vector offset also represents specific attributes of word vectors. In a number of follow up studies to Mikolov et al. (2013c), different variants of the vector-offset methods have been proposed, or entirely new approaches presented in order to solve the word analogy task Levy and Goldberg (2014b); Vylomova et al. (2015); Drozd et al. (2016). In Section 4.2, we compare our approach for solving the task to the unsupervised methods.

6 Conclusion

In this study, we presented a novel approach for decomposing word embeddings into meaningful sub-word representations, which we call sub-vectors. The method allows analyzing the information encoded in a word vector or the relation between groups of words. For a rigorous definition of the approach, we defined the distributional decomposition hypothesis. On the basis of the defined hypothesis, we introduced semantic space networks (SSNs), which is a framework for a systematic decomposition of word embeddings. Using the proposed framework, we have been able to identify sub-vectors capturing different attributes of words, such as gender, number or tense. Moreover, we introduced the category completion task and demonstrated that SSNs are much more data efficient than supervised classifiers on the task. We also proposed an approach to solve the word analogy task based on SSNs and show that the method outperforms different variants of the vector-offset method.

Important future applications of SSNs lie in diagnostics of models in downstream tasks. By decomposing input word embeddings, we can find out what kind of features are feed into a model, and by adding or removing sub-vectors, we are able manipulate the input. We can then analyze, how the changes affect the predictions of the model and whether it has learned the desired input output relations.

7 Acknowledgements

This work has been supported by the German Research Foundation as part of the Research Training Group ”Adaptive Preparation of Information from Heterogeneous Sources” (AIPHES) at the Technische Universität Darmstadt under grant No. GRK 1994/1.

Appendix A Appendix

a.1 Analysis of lexical relations: experimental details

The experiments have been performed using pretrained, unnormalized word2vec embeddings5. Since the vector space is more regular for more frequent words, we restrict the vocabulary to 11,000 highest ranked words in the word2vec vocabulary. In order to further reduce the influence of noisy word-vectors, we have omitted all multi-word expressions.

a.2 Analysis of lexical relation using the semantic tree model

Antonymy

The evaluation of the antonymy relation using word embeddings is problematic as the antonym word vectors are often not symmetric. In in the antonymy relation analyzed in Example 5, the word vector for woman, for example, is significantly larger than the vector for man. This is because the word man is used in more contexts and has therefore more meanings. As discussed in Section 2, the word vector in such cases ”loses” meaning and becomes shorter. As a result, the branch vector for man is shorter than the one for woman. Moreover, the branch vector for woman has much more children word vectors, which means that it captures richer semantic information. However, it also includes words, which are in general not associated with the attribute , such as or . We therefore also derive the orthogonal component vector of the vector to the root, which we denote by . It must be noted that this vector is in exact opposition to the branch vector . The branch vector has less children vectors but these are more obviously associated to the attribute of being .
Example 5.
= [];
= 2.311; = 2.656;
= root; ch() = 2;
ch() = [];

ch() =

ch() = 46;
ch() =

ch() = 17
ch() =

Polysemy

Polysemous word have a number of different meanings, which are to some extend represented in their word embeddings. Using the semantic space tree model, we can to some degree recover a vector which represents a particular synset of the considered word. As illustrated in Example 6, by subtracting the meaning of different words associated with chairman from the word vector of chair, one can to some extend recover its second meaning, namely that of a piece of furniture.
Example 6.
cos_sim

= [
];
= root; =
cos_sim

Hypernymy

The defined sub-vectors can be naturally considered the hypernyms of the children vectors as they represent the attribute which is shared by all of these word vectors. Hence, the question arises, why are word vectors of hypernym words not sup-vectors of their hyponyms? In fact, the cosine similarity between the word vectors of hyponyms and their hypernyms is often very small. We analyze this phenomenon in Example 7.

Example 7.
cos_sim() = [
]
= [];
= root(); ch() = 12
= 2.084; = 2.401;
cos_sim() = [
]

As the example illustrates, the word vector is also a member of a category which refers to temporal concepts, such as . It follows that, hypernyms can also contain different meanings which are not associated with their hyponyms.

a.3 Semantic space networks

The semantic column model

Figure 3: Semantic column: one single word vector is split into two sub-vectors

=

= ;   par_count() = 5;
ch() = [, , , , ];

= ;   par_count() = 17;
ch() = [ch(),

Ternary tree

Figure 4: Ternary tree: the interrelation of the three word vectors is explored by combining three semantic trees

= =

ch() =


ch() =



ch() =



ch() =
ch() =
ch() =
ch() =

ch() =

ch() =

Quadruple relations (analogy problem)

Figure 5: Quadruple relation: four semantic tree are joint

=

ch() =
ch() =
ch() =
ch() =

ch() =
ch() =
ch() =
ch() =

ch() =
ch() =

ch() =

ch() =

Footnotes

  1. https://github.com/hanselowski/embedding_decomp
  2. In practise, the Noise Contrastive Estimation objective function is typically used Mikolov et al. (2013b).
  3. This interpretation is not unique and a similar line of reasoning is presented in Arora et al. (2018) or Rothe and Schütze (2017).
  4. To facilitate reproducibility and comparison of the results we have used the word-embeddings-benchmarks platform in our experiments https://github.com/kudkudak/word-embeddings-benchmarks
  5. https://code.google.com/archive/p/word2vec/

References

  1. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association of Computational Linguistics, 6:483–495.
  2. Lawrence W Barsalou. 1983. Ad hoc categories. Memory & cognition, 11(3):211–227.
  3. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537.
  4. Ryan Cotterell, Hinrich Schütze, and Jason Eisner. 2016. Morphological smoothing and extrapolation of word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1651–1660.
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. 2016. Word embeddings, analogies, and machine learning: Beyond king-man+ woman= queen. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3519–3530.
  7. Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004.
  8. Alona Fyshe, Partha P Talukdar, Brian Murphy, and Tom M Mitchell. 2014. Interpretable semantic vectors from a joint model of brain-and text-based meaning. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2014, page 489. NIH Public Access.
  9. Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
  10. Julia Hirschberg and Christopher D Manning. 2015. Advances in natural language processing. Science, 349(6245):261–266.
  11. Kyoung-Rok Jang and Sung-Hyon Myaeng. 2017. Elucidating conceptual properties from word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 91–95.
  12. David A Jurgens, Peter D Turney, Saif M Mohammad, and Keith J Holyoak. 2012. Semeval-2012 task 2: Measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 356–364. Association for Computational Linguistics.
  13. Aykut Koç, Ihsan Utlu, Lutfi Kerem Senel, and Haldun M Ozaktas. 2018. Imparting interpretability to word embeddings. arXiv preprint arXiv:1807.07279.
  14. George Lakoff. 2008. Women, fire, and dangerous things. University of Chicago press.
  15. Omer Levy and Yoav Goldberg. 2014a. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 302–308.
  16. Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning, pages 171–180.
  17. Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066.
  18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  19. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  20. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751.
  21. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  22. Brian Murphy, Partha Talukdar, and Tom Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING 2012, pages 1933–1950. The COLING 2012 Organizing Committee.
  23. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  24. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  25. Sascha Rothe, Sebastian Ebert, and Hinrich Schütze. 2016. Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572.
  26. Sascha Rothe and Hinrich Schütze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. arXiv preprint arXiv:1507.01127.
  27. Sascha Rothe and Hinrich Schütze. 2017. Autoextend: Combining word embeddings with semantic resources. Computational Linguistics, 43(3):593–617.
  28. Andreas Rücklé, Steffen Eger, Maxime Peyrard, and Iryna Gurevych. 2018. Concatenated -mean word embeddings as universal cross-lingual sentence representations. arXiv preprint arXiv:1803.01400.
  29. Adriaan MJ Schakel and Benjamin J Wilson. 2015. Measuring word significance using distributed representations of words. arXiv preprint arXiv:1508.02297.
  30. Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2015. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. arXiv preprint arXiv:1509.01692.
  31. Yadollah Yaghoobzadeh and Hinrich Schütze. 2016. Intrinsic subspace evaluation of word embedding representations. arXiv preprint arXiv:1606.07902.
  32. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe magazine, 13(3):55–75.
  33. Alexey Zobnin. 2017. Rotations and interpretability of word embeddings: The case of the Russian language. In International Conference on Analysis of Images, Social Networks and Texts, pages 116–128. Springer.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
403082
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description