Towards Understanding Linear Word Analogies
A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as skip-gram with negative sampling (SGNS). We provide a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. Our theory has several implications. Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.
Distributed representations of words are a cornerstone of current methods in natural language processing. Word embeddings, also known as word vectors, can be generated by a variety of models, all of which share Firth’s philosophy (1957) that the meaning of a word is defined by “the company it keeps”. The simplest such models obtain word vectors by constructing a low-rank approximation of a matrix containing a co-occurrence statistic (Landauer and Dumais, 1997; Rohde et al., 2006). In contrast, neural network models (Bengio et al., 2003; Mikolov et al., 2013b) learn word embeddings by trying to predict words using the contexts they appear in, or vice-versa.
A surprising property of word vectors derived via neural networks is that word analogies can often be solved with vector algebra. For example, ‘king is to ? as man is to woman’ can be solved by finding the closest vector to , which should be . It is unclear why linear operators can effectively compose embeddings generated by non-linear models like skip-gram with negative sampling (SGNS). There have been two attempts to rigorously explain this phenomenon, but both have made strong assumptions about either the embedding space or the word distribution. The paraphrase model (Gittens et al., 2017) hinges on words having a uniform distribution rather than the typical Zipf’s distribution, which the authors themselves acknowledge is unrealistic. The latent variable model (Arora et al., 2016) makes many a priori assumptions about the word vectors, such as the assumption that word vectors are generated by randomly scaling vectors uniformly randomly sampled from a unit sphere.
In this paper, we explain why – and under what conditions – word analogies in GloVe and SGNS embedding spaces can be solved with vector algebra, without making the strong assumptions past work has. We begin by formalizing word analogies as functions that transform one word vector into another. When this transformation is simply the addition of a displacement vector – as is the case when using vector algebra – we call the analogy a linear analogy.
We first prove that in both SGNS and GloVe embedding spaces without reconstruction error, a linear analogy holds over a set of ordered word pairs iff each word pair has the same value of . We call this expression the co-occurrence shifted pointwise mutual information (csPMI). By then framing vector addition as a kind of word analogy, we offer several new insights into the additive compositionality of words:
Past work has often cited the Pennington et al. (2014) conjecture as an inuitive explanation of why vector algebra works for analogy solving. The conjecture is that an analogy of the form is to as is to holds iff for every other word in the vocabulary. While this is an intuitive idea, it is not based on any theoretical derivation or empirical support. We provide a rigorous proof that this is indeed true.
Consider two words and their sum in an SGNS embedding space with no reconstruction error. If were in the vocabulary, the similarity between and (as measured by the csPMI) would be the log probability of shifted by a model-specific constant. This implies that the addition of two words automatically down-weights the more frequent word. Since many weighting schemes are based on the principle that more frequent words should be down-weighted ad hoc (Robertson, 2004; Arora et al., 2017), the fact that this is done automatically provides novel justification for using addition to compose words.
Consider any two words in an SGNS embedding space with no reconstruction error. The squared Euclidean distance between and is the negative csPMI shifted by a model-specific constant. In other words, on average, the more similar two words are (as measured by csPMI) the smaller the distance between their vectors in the embedding space. Although this result is intuitive, it is also the first rigorous explanation of why the Euclidean distance in embedding space is a good proxy for word dissimilarity.
Although our main theorem only concerns embedding spaces with no reconstruction error, we also explain why, in practice, linear word analogies hold in embedding spaces with some noise. We conduct experiments that support the few assumptions we make and show that the transformations represented by various word analogies correspond to unique csPMI values. Without making the strong assumptions of past theories, we thus offer a rigorous explanation of why, and when, word analogies can be solved with vector algebra.
2 Related Work
Pointwise mutual information (PMI) is a common measure of word similarity. For two words , it captures how much more frequently they co-occur than by chance: (Church and Hanks, 1990).
Word embeddings are distributed representations of words, typically in a low-dimensional continuous space. Also called word vectors, they can capture semantic and grammatical properties of words, even allowing relationships to be expressed algebraically (Mikolov et al., 2013b). Word vectors are generally obtained in two ways: (a) from neural networks that learn word representations by predicting co-occurrence patterns in the training corpus (Bengio et al., 2003; Mikolov et al., 2013b; Collobert and Weston, 2008); (b) from low-rank approximations of word-context matrices containing a co-occurrence statistic (Landauer and Dumais, 1997; Levy and Goldberg, 2014).
The objective of skip-gram with negative sampling (SGNS) is to maximize the probability of observed word-context pairs and to minimize the probability of randomly sampled negative examples. For an observed word-context pair , the objective would be , where is the negative context, randomly sampled from a scaled distribution . The objective is optimized over all the word-context pairs in the corpus. Words that appear in similar contexts will therefore have similar embeddings. Though no co-occurrence statistics are explicitly calculated, Levy and Goldberg (2014) proved that SGNS is in fact implicitly factorizing a word-context PMI matrix shifted by a constant.
Latent Variable Model
The latent variable model (Arora et al., 2016) was the first attempt to rigorously explain why word analogies can be solved algebraically. It is a generative model that assumes that word vectors are generated by the random walk of a “discourse” vector on the unit sphere. Gitten et al.’s (2017) criticism of this proof is that it assumes that word vectors are known a priori and that they are generated by randomly scaling vectors uniformly sampled from the unit sphere (or having properties consistent with this sampling procedure). The proof also relies on a conjecture by Pennington et al. (2014) that linear relations can be expressed as a ratio of probabilities.
The paraphrase model (Gittens et al., 2017) was the only other attempt to rigorously explain why word analogies can be solved algebraically. It proposes that any set of context words is semantically equivalent to a single word if . One problem with this idea is that the number of possible context sets far outnumbers the vocabulary size, precluding a one-to-one mapping of sets to words; the authors circumvent this problem by replacing exact equality with the minimization of KL divergence. Assuming that the words have a uniform distribution, the paraphrase of can then be written as an unweighted sum of its word vectors. However, this uniformity assumption is unrealistic – word frequencies obey a Zipf’s distribution, which is Pareto (Piantadosi, 2014).
3 The Structure of Word Analogies
3.1 Formalizing Analogies
A word analogy is a statement of the form “a is to b as x is to y”, which we will write as (a,b)::(x,y). It asserts that and can be transformed in the same way to get and respectively, and that and can be inversely transformed to get and . A word analogy can hold over an arbitrary number of ordered pairs: e.g., “Berlin is to Germany as Paris is to France as Ottawa is to Canada …”. The elements in each ordered pair do not need to exist in the same space either – for example, (king,roi)::(queen,reine) is an analogy across English and French, where the transformation is English-to-French translation. For (king,queen)::(man,woman), the canonical analogy in the word embedding literature, the transformation corresponds to changing the gender. Therefore, to formalize the definition of an analogy, we will refer to it as a transformation.
An analogy is an invertible transformation that holds over a set of ordered pairs iff .
The word embedding literature (Mikolov et al., 2013b; Pennington et al., 2014) has focused on a very specific type of transformation, the addition of a displacement vector. For example, for (king,queen)::(man,woman), the transformation would be , where the displacement vector is expressed as the difference . To make a distinction with our general class of analogies in Definition 1, we will refer to these as linear analogies.
A linear analogy is an invertible transformation of the form . holds over a set of ordered pairs iff .
Co-occurrence Shifted PMI Theorem
Let be an SGNS or GloVe embedding space with no reconstruction error and be a set of ordered pairs such that A linear analogy holds over iff .
Throughout the rest of this paper, we will refer to as the co-occurrence shifted PMI (csPMI) of and . In sections 3.2 and 3.3, we prove the csPMI Theorem. In section 3.4, we explain why, in practice, perfect reconstruction is not needed to solve word analogies using vector algebra. In section 4, we explore what the csPMI Theorem implies about vector addition and Euclidean distance in SGNS embedding spaces.
3.2 Analogies as Parallelograms
Where denotes the inner product, a linear analogy holds over a set of ordered pairs iff ,
When is empty, Lemma 1 is vacuously true. For the remaining cases, let . When Lemma 1 holds. When , consider the subsets of the form . holds over every subset iff it holds over . We start by noting that by Definition 2, holds over and iff:
By rearranging the equations in (1), we know that and that . Put another way, form a quadrilateral in vector space whose opposite sides are parallel and equal in length. By definition, this quadrilateral is then a parallelogram. In fact, this is often how word analogies are visualized in the literature (see Figure 1).
The opposite angles of a parallelogram – and their cosines – must be equal in size. We write this in terms of the inner products of vector differences and simplify:
Thus holds over any subset iff is the same for both pairs. Since , holds over iff each ordered pair satisfies , for some .
3.3 Proof of the csPMI Theorem
Let be an SGNS or GloVe embedding space with no reconstruction error. For any two words in where is the number of negative samples, is the frequency, and are the learned biases for GloVe:
The SGNS identity was adapted from Levy and Goldberg (2014), who showed that SGNS is implicitly factorizing the -shifted word-context PMI matrix. The identity for GloVe in (3) is simply the local objective for a word pair (Pennington et al., 2014). In both cases, we have have replaced the context vector of with its word vector. Since the matrices being factorized by both SGNS and GloVe are symmetric, the word and context matrices only differ due to random initialization.
From Lemma 1, we know that a linear analogy holds over a set of ordered pairs iff . Because there is no reconstruction error, we can simplify this identity by expanding the inner products using the SGNS identity in (3):
We get the same result by expanding the GloVe identity in (3), regardless of what the learned biases are. Thus, as stated in the csPMI Theorem, in a SGNS or GloVe embedding space with no reconstruction error, a linear analogy holds over a set of ordered pairs iff .
3.4 Robustness to Noise
The csPMI Theorem does not explain why, in practice, linear word analogies hold in embedding spaces that have some reconstruction error. There are two key reasons that allow this: the looser definition of vector equality in practice and the lower variance in reconstruction error associated with more frequent word pairs. For one, in practice, a word analogy task of the form a:?::x:y is solved by finding the most similar vector to , where dissimilarity is defined in terms of Euclidean or cosine distance. Therefore, vector algebra can be used to find the correct solution to a word analogy even when that solution is not exact.
The second reason is that the variance of the noise for a word pair is a strictly decreasing function of the frequency : i.e., more frequent word pairs are associated with less reconstruction error in both SGNS and GloVe. This is because the cost of deviating from the optimal value is higher for more frequent word pairs; this is implicit in the SGNS objective (Levy and Goldberg, 2014) and explicit in the GloVe objective (Pennington et al., 2014). We also show empirically that this is true in Section 5. Assuming , where is the Dirac delta distribution:
As the frequency of a word pair increases, the probability that the noise is negligible increases; when the frequency is infinitely large, the noise is sampled from the Dirac delta distribution and is therefore 0. Even without the assumption of zero reconstruction error, an analogy that satisfies the identity in Theorem 1 will hold over a set of ordered pairs as long as the frequency of each pair is large enough for noise to be negligible.
A possible benefit of mapping lower frequencies to larger variances is that it reduces the probability that a linear analogy will hold over rare word pairs. One way of interpreting this is that the variance function essentially filters out the word pairs for which there is insufficient evidence, even if the csPMI of the word pair equals . This would explain why reducing the dimensionality of word vectors – up to a point – actually improves performance on word analogy completion tasks (Landauer and Dumais, 1997).
4 Vector Addition as a Word Analogy
4.1 Formalizing Addition
Let be the sum of two words and in an SGNS vector space with no reconstruction error. If were a word in the vocabulary, where is a model-specific constant, .
To frame the addition of two words as a word analogy, we need to define a set of ordered pairs such that a linear analogy holds over iff . To this end, consider the set , where is a placeholder for the composition of and and the null word is a placeholder that maps to the zero vector for a given embedding space. Different embedding spaces have different null words and therefore different values of . From Definition 2, we know that:
An inner product with the zero vector is always 0, so we can infer from the SGNS identity in (3) that for every word in the vocabulary. From the csPMI Theorem, we know that a linear analogy holds over iff:
Thus the csPMI of the sum and one word is equal to the log probability of the other word shifted by a constant. In embedding spaces with some reconstruction error, there are also two noise terms to consider. However, if we assume, as in 3.4, that the noise has a zero-centered Gaussian distribution, then . Even without the assumption of zero reconstruction error, on average, the csPMI of the sum and one word is equal to the log probability of the other word shifted by a constant. We cannot repeat this derivation with GloVe because it is unclear what the optimal values of the biases would be, even with perfect reconstruction.
4.2 Automatically Weighting Words
In an SGNS embedding space, on average, the sum of two words has more in common with the rarer word, where commonality is measured by the csPMI.
For two words assume without loss of generality that . By (7):
Therefore the sum has more in common with the rarer word (i.e., addition automatically down-weights the more frequent word). For example, if the vectors for x = ‘the’ and y = ‘apple’ were added to create a vector for z = ‘the apple’, we would expect csPMI(‘the apple’, ‘apple’) csPMI(‘the apple’, ‘the’); being a stopword, ‘the’ would on average be heavily down-weighted. Even in vector spaces with reconstruction error, if we assume that the noise follows a zero-centered Gaussian distribution, (8) holds true on average.
While the rarer word is not always the more informative one, weighting schemes like inverse document frequency (IDF) (Robertson, 2004), smoothed inverse frequency (SIF) (Arora et al., 2017), and unsupervised smoothed inverse frequency (uSIF) (Ethayarajh, 2018) are all based on the principle that more frequent words should be down-weighted because they are typically less informative. The fact that addition automatically down-weights the more frequent word thus provides novel theoretical justification for using addition to compose words.
4.3 Interpreting Euclidean Distance
For any words in an SGNS embedding space with no reconstruction error, where is a model-specific constant, .
We derive this corollary by framing the difference between two words as a word analogy. Where is a placeholder for and is the null word defined in section 4.1, a linear analogy holds over the set iff . Using the fact that , the SGNS identity in (3), and the result from (7):
Thus in an SGNS embedding space with no reconstruction error, the squared Euclidean distance between two word vectors is simply the negative csPMI shifted by a model-specific constant. This result is intuitive: the more similar two words are (as measured by csPMI), the smaller the distance between their vectors. In section 5, we show empirically that the squared Euclidean distance in a SGNS space (with reconstruction error) is a noisy approximation of the constant-shifted negative csPMI.
4.4 Interpreting Cosine Distance
The csPMI Theorem is not needed to interpret the cosine distance between two SGNS word vectors; the proof by Levy and Goldberg (2014) is sufficient. However, since the interpretation has not been provided in past work, we provide it here. For any words in an SGNS embedding space with no reconstruction error, by (3), the cosine similarity is:
Note that increasing while holding and constant decreases both the numerator and denominator. While these effects do not cancel out exactly, this explains why, empirically, cosine similarity has been found to be fairly robust to word frequencies Schakel and Wilson (2015). The cosine distance between two word vectors is therefore inversely related to the PMI of the word pair while being less sensitive to the frequencies of each word than PMI.
4.5 Are Relations Ratios?
Pennington et al. (2014) conjectured that linear relationships in the embedding space – which we call displacements – correspond to ratios of the form , where is a pair of words such that is the displacement and is any other word in the vocabulary. This claim has since been repeated in other work Arora et al. (2016). For example, according to this conjecture, the analogy (king,queen)::(man,woman) holds iff for every word in the vocabulary
However, as we noted earlier, this idea was neither derived from empirical results nor rigorous theory, and there has not been any rigorous work to suggest it would hold for models other than GloVe, which was designed around it. We now prove this conjecture for SGNS using the csPMI Theorem.
Pennington et al. Conjecture
Let be a set of ordered word pairs with vectors in an SGNS embedding space with zero reconstruction error. A linear analogy holds over iff for every word in the vocabulary.
As with the corollaries, we prove this by re-framing it as an analogy. A linear analogy holds over iff for any word in the vocabulary, a linear analogy holds over , where is the relation defined by the word pair . In , is transformed into in each word pair; in , is transformed into the null word and then into the relation , which can be composed into a single linear transformation. From (4), we know that a linear analogy holds over iff for any :
Using the SGNS identity in (3), we can write this in terms of PMI and then the conditional probability:
Thus a linear analogy holds over for any word iff for every . Since a linear analogy holds over iff it holds over , the Pennington et al. Conjecture is true.
We uniformly sample word pairs in Wikipedia and estimate the noise (i.e., ) using SGNS vectors trained on the same corpus. As seen in Figure 2, the noise has an approximately zero-centered Gaussian distribution and the variance of the noise is lower at higher frequencies, supporting our assumptions in section 3.4. As previously mentioned, this is one reason why linear word analogies are robust to noise – the amount of noise is simply negligible at high frequencies.
In Table 1, we provide the mean csPMI values for various analogies in Mikolov et al. (2013a) over the set of word pairs for which they should hold (e.g., (Paris, France), (Berlin, Germany), (Ottawa, Canada) and others for capital-world). As expected, vector algebraic solutions to word analogies are much more accurate on analogies with lower csPMI variances. The lower the variance, the more likely it is that the displacement vectors of different word pairs will be the same, thus making it more likely that the analogy will hold.
Similar analogies (e.g., capital-world and capital-common-countries) also have similar mean csPMI values – our theory implies this, since similar analogies have similar displacements in the vector space. As the csPMI increases, the type of analogies gradually changes from geography (capital-world, city-in-state) to verb tense (gram5-present-participle, gram7-past-tense) to adjectives (gram4-superlative, gram1-adjective-to-adverb). We do not witness the same gradation with the mean PMI, implying that the transformation represented by an analogy does uniquely correspond to a csPMI value. However, because the “true” displacement vector for an analogy is unknown, the “true” csPMI value is also unknown; we can only estimate it as the mean csPMI.
|Analogy||Mean csPMI||Mean PMI||Median Word Pair Frequency||csPMI Variance||Accuracy|
Because the sum of two word vectors is not in the vocabulary, we cannot calculate co-occurrence statistics involving the sum, precluding us from testing Corollaries 1 and 2. We test Corollary 3 by uniformly sampling word pairs in Wikipedia and plotting, in Figure 3, the negative csPMI against the squared Euclidean distance between the SGNS word vectors. As we would expect, given Corollary 3, there is a moderately strong and positive correlation (Pearson’s = 0.437): the more similar two words are (as measured by csPMI) the smaller the Euclidean distance between their vectors.
Even though vector algebra is surprisingly effective at solving word analogies, the csPMI Theorem reveals two reasons for why an analogy may be unsolvable in a given embedding space: polysemy and corpus bias. Consider a polysemous word and its senses . Assuming zero reconstruction error, a linear analogy identified by csPMI holds over iff:
Even though only one sense of may be relevant to , (14) concerns all the senses of , making it unlikely that .
A more prevalent problem is corpus bias. Even if (a,b)::(x,y) makes intuitive sense, there is no guarantee that csPMI will be approximately equal to csPMI for a given corpus. The less frequent a word pair is, the more pronounced this issue, since even small changes in frequency can have a large change on the csPMI of the word pair. This is the main reason why the accuracy for the currency analogy is almost zero (see Table 1) – word pairs of currencies and their country (e.g., (Algeria, dinar)) rarely occur in Wikipedia and have a median frequency of only 19. The solution to this problem would be to use a more representative training corpus; for example, for the currency analogy, the Wall Street Journal corpus would be a better choice.
In this paper, we provided a rigorous explanation of why – and when – word analogies can be solved using vector algebra. More specifically, we proved that an analogy holds over a set of word pairs in an SGNS or GloVe embedding space with no reconstruction error iff the co-occurrence shifted PMI is the same for every word pair. Our theory had three implications. First, we provided a rigorous proof of the Pennington et al. (2014) conjecture, which had heretofore been the intuitive explanation for this phenomenon. Second, we provided novel theoretical justification for the addition of word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Third, we provided the first rigorous explanation of why the Euclidean distance between word vectors is a good proxy for word dissimilarity. Most importantly, our theory did not make the unrealistic assumptions that past theories have made about the word distribution and vector space, making it much more tenable than previous explanations.
- Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399.
- Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
- Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155.
- Church and Hanks (1990) Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29.
- Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167. ACM.
- Ethayarajh (2018) Kawin Ethayarajh. 2018. Unsupervised random walk sentence embeddings: A strong but simple baseline. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 91–100. Association for Computational Linguistics.
- Firth (1957) John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis.
- Gittens et al. (2017) Alex Gittens, Dimitris Achlioptas, and Michael W Mahoney. 2017. Skip-gram – Zipf + uniform = vector additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 69–76.
- Landauer and Dumais (1997) Thomas K Landauer and Susan T Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177–2185.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg S Corrado, and Jeff Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Piantadosi (2014) Steven T Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5):1112–1130.
- Robertson (2004) Stephen Robertson. 2004. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60(5):503–520.
- Rohde et al. (2006) Douglas LT Rohde, Laura M Gonnerman, and David C Plaut. 2006. An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8(627-633):116.
- Schakel and Wilson (2015) Adriaan MJ Schakel and Benjamin J Wilson. 2015. Measuring word significance using distributed representations of words. arXiv preprint arXiv:1508.02297.