Deconstructing and reconstructing word embedding algorithms
Uncontextualized word embeddings are reliable feature representations of words used to obtain high quality results for various NLP applications. Given the historical success of word embeddings in NLP, we propose a retrospective on some of the most well-known word embedding algorithms. In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the necessary and sufficient conditions required for making performant word embeddings. We find that each algorithm: (1) fits vector-covector dot products to approximate pointwise mutual information (PMI); and, (2) modulates the loss gradient to balance weak and strong signals. We demonstrate that these two algorithmic features are sufficient conditions to construct a novel word embedding algorithm, Hilbert-MLE. We find that its embeddings obtain equivalent or better performance against other algorithms across 17 intrinsic and extrinsic datasets.
Word embeddings have been established as standard feature representations for words in most contemporary NLP tasks Kim (2014); Huang et al. (2015); Goldberg (2016). Their incorporation into larger models – from CNNs and LSTMs for sentiment analysis Zhang et al. (2018), to sequence-to-sequence models for machine translation Qi et al. (2018), to the input layer of deep contextualized embedders Peters et al. (2018) – enables high quality performance across a wide variety of problems.
Being the building blocks for many modern NLP applications, we argue that it is worthwhile to subject word embedding algorithms to close theoretical inspection. This work can be considered a retrospective analysis of the ground-breaking word embedding algorithms of the past, which simultaneously offers theoretical insights for how future, deeper models can be developed and understood. Indeed, analogous to a watchmaker who curiously scrutinizes the mechanical components comprising her watches’ oscillators, so too do we aim to uncover what makes word embeddings “tick”.
It is well-known that word embedding algorithms train two sets of embeddings: the vectors (“input” vectors) and the covectors (“output”, or, “context” vectors). However, the covectors tend to be regarded as an afterthought when used by NLP practitioners, either being thrown away Mikolov et al. (2013b), or averaged into the vectors Pennington et al. (2014); Levy et al. (2015).
Nonetheless, recent work has found that separately incorporating pretrained covectors into downstream models can improve performance in specific tasks. This includes lexical substitution Melamud et al. (2015); Roller and Erk (2016), information retrieval Nalisnick et al. (2016), state-of-the-art metaphor detection Mao et al. (2018) and generation Yu and Wan (2019), and more Press and Wolf (2017); Asr et al. (2018); Değirmenci et al. (2019). In this work, we contribute an engaged theoretical treatment of covectors, and later elucidate the different relationships learned separately by vectors and covectors (§6.5).
Training these vectors and covectors can be done by a variety of high-performing algorithms: the sampling-based shallow neural network of SGNS Mikolov et al. (2013b), GloVe’s weighted least squares over global corpus statistics Pennington et al. (2014), and matrix factorization methods Levy and Goldberg (2014b); Levy et al. (2015); Shazeer et al. (2016). In this work, we propose a framework for understanding these algorithms from a common vantage point. We deconstruct each algorithm into its constituent parts, and find that, despite their many different hyperparameters, the algorithms collectively intersect upon the following two key design features:
vector-covector dot products are learned to approximate pointwise mutual information (PMI) statistics in the corpus; and,
modulation of the loss gradient, directly or indirectly, to balance weak and strong signals arising from the highly imbalanced distribution of corpus statistics.
Finding these commonalities across algorithms, we beg the question of whether or not these features are sufficient for reconstructing a new word embedding algorithm. Indeed, we derive and implement a novel embedding algorithm, Hilbert-MLE 111The name is inspired by the intuitions behind this work concerning Hilbert spaces, and the maximum likelihood estimate that defines the model’s loss function., by following these two principles to derive their corresponding global matrix factorization loss function based on the maximum likelihood estimate of the multinomial distribution of corpus statistics.
However, due to the infeasibility of matrix factorization objectives for large vocabulary sizes, we further derive a local sampling-based formulation by algebraically deconstructing Hilbert-MLE’s global objective function. As we abstractly depict in Figure 1, this derivation can be seen as a mirrored derivation of that which is presented by Levy and Goldberg (2014b), who derived the global matrix factorization for SGNS from the original local sampling formulation Mikolov et al. (2013b).
We find that Hilbert-MLE produces word embeddings that earn equivalent or better performance against SGNS and GloVe across 17 intrinsic and extrinsic datasets, therefore demonstrating the sufficiency of the two principles for designing a word embedding algorithm.
To summarize, this work offers the following contributions:
2 Fundamental concepts
In this section, we introduce notation and concepts that we will draw upon throughout this paper. This includes formally defining embeddings, their vectors and covectors, and pointwise mutual information (PMI).
In general topology, an embedding is understood as an injective “structure preserving map”, , between two mathematical structures and . A word embedding algorithm () learns an inner-product space () to preserve a linguistic structure within a reference corpus of text, (), based on the words in a vocabulary, . The structure in is analyzed in terms of the relationships between words induced by their appearances in the corpus. In such an analysis, each word figures dually: (1) as a focal element inducing a local context; and (2) as elements of the local contexts induced by focal elements. To make these dual roles explicit, we distinguish two copies of the vocabulary: the focal words (or, terms), and the context words .
An embedding consists of two maps:
We use Dirac notation to distinguish vectors , associated to focal words, from covectors , associated to context words. In matrix notation, corresponds to a column vector and to a row vector. Their inner product is ; this inner product completely characterizes the learned vector space. We will later demonstrate that many word embedding algorithms, intentionally or not, learn a vector space where the inner product between a focal word and context word aims to approximate their PMI in the reference corpus: .
Pointwise mutual information (PMI).
PMI is a commonly used measure of association in computational linguistics, and has been shown to be consistent and reliable for many tasks Terra and Clarke (2003). It measures the deviation of the cooccurrence probability between two words and from the product of their marginal probabilities:
where is the probability of word and word cooccurring (for some notion of cooccurrence), and where and are marginal probabilities of words and occurring. The empirical PMI can be found by replacing probabilities with corpus statistics. Words are typically considered to cooccur if they are separated by no more than words; is the number of counted cooccurrences between a context and a term ; , , and are computed by marginalizing over the statistics.
3 Word embedding algorithms
We will now introduce the low rank embedder framework for deconstructing word embedding algorithms, inspired by the theory of generalized low rank models Udell et al. (2016). We unify several word embedding algorithms by observing them all from the common vantage point of their global loss function. Note that this framework is only used for theoretical analysis, not practical implementation.
The global loss function for a low rank embedder takes the form:
where is a kernel function, and is some scalar function (such as a measure of association based on how and appear in the corpus); and are abbreviations for the same.
The design variable is some function of corpus statistics, and its purpose is to quantitatively measure some relationship between word and . In apposition, the design variable is a function of model parameters, and its purpose is to learn a succinct approximation:
and so represent the relationship measured by . Intuitively, we can think of a low rank embedder as trying to directly fit a kernel function of model parameters to some (statistical) relationship between words of our choosing. For example, SGNS takes and , and then learns parameter values that approximate .
Though the specific choice of varies slightly, existing low rank embedders generally base on cooccurrence of words within a linear window words wide. But it is worth pointing out that can in principle be any pairwise relationship encoded as a scalar function of corpus statistics.
As for the kernel function , one simple choice is to take . But the framework allows any function that is symmetric and positive definite. This allows the framework to include the use of bias parameters (e.g. in GloVe) and subword parameterization (e.g. in FastText).
To understand the range of models encompassed, it is helpful to see how the framework relates (but is not limited) to matrix factorization. We can think of and as providing the entries of two matrices:
For models that take , we can write , where is defined as having row equal to , and as having column equal to . Then, the loss function can be rewritten as:
This loss function can be interpreted as matrix reconstruction error, because the constraint in Eq. 3 means that the gradient goes to zero as .
Selecting a particular low rank embedder instance requires key design choices to be made: we must chose the embedding dimension , the form of the loss terms , the kernel function , and the association function . Only the gradient of actually affects the algorithm. The derivative of with respect to , which we call the characteristic gradient, helps compare models because it exhibits the action of the gradient yet is symmetric in the parameters. Thus, we address a specific embedder by the tuple .
3.1 SGNS as a low rank embedder
Levy and Goldberg (2014b) provided the important result that skip gram with negative sampling (SGNS) Mikolov et al. (2013b) was implicitly factorizing the matrix. However, Levy and Goldberg did not derive the loss function needed to explicitly pose SGNS as matrix factorization, and required additional assumptions for their derivation to hold. Moreover, empirically, they used SVD on the related (but different) positive-PMI matrix, and this did not reproduce the results of SGNS. In other work, Li et al. (2015), provided an explicit MF formulation of SGNS from a “representation learning” perspective. This derivation diverges from Levy and Goldberg’s result, and masks the connection between SGNS and other low rank embedders. In this work, we derive the complete global loss function for SGNS, free of additional assumptions.
The loss function of SGNS is as follows:
where is the logistic sigmoid function, is a list containing each cooccurrence of a context-word with a focal-word in the corpus, and the expectation is taken by drawing from the (smoothed) unigram distribution to generate “negative samples” for a given focal-word. Mikolov et al. (2013b).
We rewrite this by counting the number of times each pair occurs in the corpus, , and the number of times each pair is drawn as a negative sample, , while indexing the sum over the set :
We can now observe that the global loss is almost in the required form for a low rank embedder (Eq. 2), and that the appropriate setting for the model approximation function is . The characteristic gradient is derived as, using the identity :
This provides that the association function for SGNS is , since the derivative will be equal to zero at that point (Eq. 3). However, recall that negative samples are drawn according to the unigram distribution (or a smoothed variant Levy et al. (2015)). This means that . Therefore, in agreement with Levy and Goldberg (2014b), we find that:
3.2 GloVe as a low rank embedder
GloVe (global vectors) was proposed as a method that strikes a halfway point between local sampling and global matrix factorization, taking the best parts from both solution methods Pennington et al. (2014). Its efficiency came from the fact that it only performs a partial factorization of the matrix, only considering samples where . We will demonstrate that GloVe is not so different from SGNS, and that it too implicitly factorizes the matrix.
GloVe’s loss function is defined as follows:
where and are learned bias parameters; and are empirically tuned hyperparameters for the weighting function , which has when .
GloVe can be cast as a low rank embedder by using the model approximation function as a kernel with bias parameters, and setting the association measure to simply be the objective of the loss function:
Let us observe the optimal solution to the loss function, when :
Multiplying the log operand by :
On the right side, we have two terms that depend respectively only on and , which are candidates for the bias terms. Based on this equation alone, we cannot draw any conclusions. However, empirically the bias terms are in fact very near and , and PMI is nearly centered at zero, as can be seen in Fig. 2. This means that Eq. 7 provides .
Analyzing the optimum of GloVe’s loss function yields important insights. First, GloVe can be added to the list of low rank embedders that learn a bilinear parameterization of PMI. Second, we can see why such a parameterization is advantageous. Generally, it helps to standardize features of low rank models Udell et al. (2016), and this is essentially what transforming cooccurrence counts into PMI achieves. Thus, PMI can be viewed as a parameterization trick, providing an approximately normal target association to be modelled.
3.3 Other algorithms as low rank embedders
We now present additional algorithms that can be cast as low rank embedders: LDS Arora et al. (2016) and FastText Joulin et al. (2017). The derivations for SVD Levy and Goldberg (2014b); Levy et al. (2015) and Swivel Shazeer et al. (2016) as low rank embedders are trivial, as both are already posed as matrix factorizations of PMI statistics.
Arora et al. (2016) introduced an embedding perspective based on generative modelling with random walks through a latent discourse space (LDS). While their only experiments were on analogy completion tasks (which do not correlate well with downstream performance Linzen (2016); Faruqui et al. (2016); Rogers et al. (2017)) LDS provided a theoretical basis for the surprisingly well-performing SIF document embedding algorithm soon afterwards Arora et al. (2017). We now demonstrate that LDS is also a low-rank embedder.
The low rank learning objective for LDS follows directly from Corollary 2.3, in Arora et al. (2016):
can be found by straightforward differentiation of LDS’s loss function:
where is as defined by GloVe. The quadratic term is a valid kernel function because:
Proposed by Joulin et al. (2017), FastText’s motivation is orthogonal to the present work. It’s purpose is to provide subword-based representation of words to improve vocabulary coverage and generalizability of word embeddings. Nonetheless, it can also be understood as a low rank embedder.
In FastText, the vector for each word is taken as the sum of embeddings for its character -grams, . Then the vector is given by the feature function , where is the vector for -gram , and is the set of -grams in word . Meanwhile covectors are accorded to words directly, rather than using -gram covector embeddings. This provides , and, by virtue of using the SGNS loss function, .
4 Deconstructing the algorithms
Table 1 presents a summary of our derivations of existing algorithms as low rank embedders.
We observe several common features between each of the algorithms. In each case, takes the form . The multiplier is always a “tempered” version of (or ), by which we mean that it increases sublinearly with (or )222In SGNS, ; and are tempered by undersampling and unigram smoothing..
Furthermore, for each algorithm, is equal to PMI or a scaled log of . Yet, the choice of in combination with provides that every model is optimized when tends toward (with or without a constant shift or scaling). We have already seen that the optimum for SGNS is equivalent to the shifted PMI (§3.1). For GloVe, we theoretically and empirically showed that incorporation of the bias terms captures the unigram counts needed for PMI (§3.2). We observe this property similarly with regards to LDS’s incorporation of the L2 norm into its learning objective, where we suspect that the unigram probability is implicitly captured in the norms of the respective vectors and covectors (§3.3).
Therefore, we observe that these embedders converge on two key points: (1) an optimum in which model parameters are bilinearly related to PMI, and (2) the weighting of by some tempered form of (or ). In the next section, we introduce Hilbert-MLE, which is derived based on the shared principles observed between the algorithms in Table 1.
5 Reconstructing an algorithm
If the two basic principles that we have identified are sufficient, then the simplest low rank embedder should be one that derives from them without any other assumptions.
We begin with principle (1), which prescribes a bilinear parameterization of PMI. The definition of PMI (Eq. 1) provides a log-bilinear parameterization of cooccurrence probability, , if we presuppose that the aim of our model is to approximate the PMI with vector-covector dot products:
In the expression above, represents the model’s estimate of the cooccurrence probability, provided by the parameterization which includes the unigram probabilities and .
Accordingly, given the matrix of covectors and vectors , the likelihood of the observed cooccurrence statistics, , is distributed like the multinomial, :
where depends on and (whose rows and columns are respectively and ) through Eq. 8. Taking the negative log likelihood as the loss:
where we have dropped constant terms that do not affect the gradient.
The unitarity axiom of probability requires that . Including this constraint with a Lagrange multiplier, we obtain:
At the feasible optimum, the original loss and constraint gradients should balance:
Eq. 12 represents equations, one for each pair . Summing these equations together,
The constrained loss function is therefore,
Reintroducing the bilinear parameterization (Eq. 8), and dividing through by to eliminate dependence on corpus size:
where, again, we have dropped constant terms that do not affect the gradient. Finally by differentiating we obtain the characteristic gradient:
This yields a loss gradient closely resembling other members of the low rank embedders. Empirically, its performance is on par with the other low rank embedders (see §6).
The multiplier, , determines how errors in fitting individual pairs trade off. While it appropriately favors fitting statistics with lower standard error, the signal from rarer pairs will be weak for any non-divergent learning rate because spans orders of magnitude. This slows down training. So, we apply a gradient conditioning measure as is done for the other low rank embedders: we apply a temperature parameter, , that reduces differences in magnitude of the multiplier:
5.1 Solving the objective function
The objective function presented in Equation 15 is most straightforwardly solved via dense matrix factorization. This can be done relatively efficiently by using the sharding method presented by Shazeer et al. (2016) for matrix factorization. Such a solution is acceptable given a small vocabulary size, but does not scale to large vocabularies, due to the quadratic dependency. GloVe Pennington et al. (2014) handled this problem by only training on statistics where . Levy and Goldberg (2014b); Levy et al. (2015) avoided the quadratic dependency by implementing sparse SVD on the positive-PMI matrix. However, both of these solutions may be missing out on important information that can be gained by “noticing what’s missing” Shazeer et al. (2016).
Yet, SGNS Mikolov et al. (2013b) was never confronted with the vocabulary size problem due to the fact that it uses local sampling over the corpus. While this yields a linear time complexity on the corpus size, this is generally preferable to a quadratic memory complexity on the vocabulary size. Levy and Goldberg (2014b) derived the global matrix factorization formulation of SGNS by moving in the algorithmic direction of local to global. Conversely, we will now move in the direction of global to local and derive the local sampling formulation of the Hilbert-MLE loss function.
Locally sampling Hilbert-MLE.
Note how if we differentiate the loss function of Hilbert-MLE (Equation 15) relative to an arbitrary model parameter , we obtain a difference between two expectations:
In words, the derivative of the loss function is the difference between the expected value of when taken under the model distribution on one hand and under the corpus distribution on the other.
This leads to a remarkably simple training algorithm. Draw a sample of word pairs from the corpus (using a local sampling approach as in SGNS), and draw a sample of pairs from the model distribution. Compute for both samples, and take their difference. The gradient of the result estimates the gradient of .
In the context of autodifferentiation libraries such as PyTorch Paszke et al. (2017) it is adequate to use a simplified loss function ,
because in the first term, the autodifferential operator will ignore the appearance of model parameters in the distribution according to which the expectation is taken, but will recognize model parameters in the expectation’s operand . Thus .
Like SGNS, this uses positive samples drawn from the corpus, balanced against negative samples. But unlike SGNS, which draws negative samples according to the (distorted) unigram distribution, here we draw negative samples from model distribution . This can be done efficiently using Gibbs sampling, making this a form of contrastive divergence Hinton (2002); Carreira-Perpinan and Hinton (2005). To approximately sample a cooccurring pair from the model distribution, we start from a corpus-derived pair, and repeatedly perform Gibbs sampling steps: randomly fix either or and re-sample the other from the model distribution conditioned on the fixed variable. E.g. if we fix , then we draw a new from . Sampling from the conditional distribution can be done in constant time using an adaptive softmax Grave et al. (2017). In theory, the model distribution is approximated after taking many Gibbs sample steps, but consistent with Hinton’s findings for contrastive divergence Hinton (2002); Carreira-Perpinan and Hinton (2005), we find that a single Gibbs sampling step supports efficient training.
We provide a simple set of experiments comparing the two characteristic models for word embeddings with ours: SGNS and GloVe against Hilbert-MLE. Our aim in these experiments is simply to verify the sufficiency of the principles we used to derive Hilbert-MLE (§5). In other words, we are testing the following hypothesis: if the principles we have proposed are sufficient for designing a word embedding algorithm, then Hilbert-MLE should perform equivalently or better than SGNS and GloVe, which were proposed with different motivating principles Mikolov et al. (2013b); Pennington et al. (2014).
In our experiments, we use a matrix factorization implementation of Hilbert-MLE as the characteristic form of the model. During experimentation, we found that the Gibbs sampling implementation of Hilbert-MLE (§5.1) performed equivalently, as expected. We present results on word similarity (§6.1), analogical reasoning (§6.2), text classification (§6.3), and sequence labelling (§6.4).
Our reference corpus combines Gigaword 3 Graff et al. (2007) with a Wikipedia 2018 dump, lower-cased, yielding 5.4 billion tokens. We limit and to be the 50,000 most frequent tokens in . We use a -token context window, and . We use the released implementations and hyperparameter choices of SGNS and GloVe. Our implementation of Hilbert-MLE uses PyTorch to take advantage of GPU-acceleration, automatic differentiation Paszke et al. (2017), and the Adam gradient descent optimizer Kingma and Ba (2015). Practically, Hilbert-MLE was implemented by using sharding Shazeer et al. (2016). We use a single 12-GB GPU, and load -element shards to calculate each update. Training embeddings took less than 3 hours for a 50,000 word vocabulary.
6.1 Word similarity
A word similarity task involves interpreting the cosine similarity between embeddings as a measure of similarity or relatedness between words. Performance is computed with the Spearman rank correlation coefficient between the model’s scoring of all pairs of words versus the gold standard human scoring. These tasks can reflect the degree of linear structure captured in the embeddings, which can provide useful insights into differences between models. However, they do not always correlate with performance in downstream tasks Chiu et al. (2016); Faruqui et al. (2016).
We used the following word similarity datasets: Simlex999 (S999) Hill et al. (2015); Wordsim353 Finkelstein et al. (2002) divided into similarity (WS-S) and relatedness (WS-R) Agirre et al. (2009); the SemEval 2017 task (SE17) Camacho-Collados et al. (2017); Radinsky Mechanical Turk (RMT) Radinsky et al. (2011); Baker Verbs 143 (B143) Baker et al. (2014); Yang Powers Verbs 130 (Y130) Yang and Powers (2006); MEN divided into a 2000-sample development set (MENd) and 1000-sample test set (MENt) Bruni et al. (2012); Rare Words (RARE) Luong et al. (2013). We had an average of 96% coverage over all word-pairs in each dataset, excepting RARE; we had 31% coverage over RARE, yielding 620 word pairs (i.e., more samples than SE17, RMT, WS-S/-R, and Y130). Results are computed on these covered word-pairs.
Word similarity results.
Table 2 presents results across the 10 word similarity tasks. We observe that Hilbert-MLE obtains the best performance in 5 out of 10 tasks. In particular, Hilbert-MLE obtains substantially better scores on S999 than the other models, earning a Spearman correlation coefficient of 0.462, an relative improvement over the next best (SGNS). Note that S999 has been shown to have a high correlation with performance in extrinsic tasks such as Named Entity Recognition and NP-chunking, unlike the other word similarity datasets Chiu et al. (2016). On the tasks with worse performance, we observe that the differences between the three algorithms are relatively marginal. Base on these experiments, and the ones that follow, we can therefore conclude that our hypothesis (§6) is valid.
6.2 Analogical reasoning
We performed intrinsic evaluation of our embeddings using standard analogy tasks (e.g., “man” is to “woman” as “king” is to ). We evaluated on the Google Analogy dataset (Google) Mikolov et al. (2013a) and the Balanced Analogy Test Set (BATS) Gladkova et al. (2016). We observed and coverage of the words in each dataset, respectively. Preliminary experiments using 3CosAdd and 3CosMul Levy and Goldberg (2014a) as selection rules, showed 3CosMul was always superior, consistent with the findings of Levy and Goldberg.
Table 2 presents results on the two analogy datasets in the final column. Hilbert-MLE performs somewhat worse than the other models on the Google Analogy dataset. However, there has been a considerable amount of work finding that performance on these tasks does not necessarily provide a reliable judgment for embedding quality (Faruqui et al., 2016; Linzen, 2016; Rogers et al., 2017). Indeed, we can see that performance on the Google Analogy dataset does not correspond with performance on the other larger analogy dataset (BATS), where Hilbert-MLE gets the best performance.
|SGNS||.910 .001||.812 .003|
|GloVe||.905 .001||.807 .003|
|Hilbert-MLE||.911 .002||.812 .003|
6.3 Text classification
We performed extrinsic evaluation for classification tasks on two benchmark NLP classification datasets. First, the IMDB movie reviews dataset for sentiment analysis Maas et al. (2011), divided into train and test sets of 25,000 samples each. Second, the AGNews news classification dataset, as divided into 8 approximately 12,000-sample classes (such as Sports, Health, and Business) by Kenyon-Dean et al. (2019); here, we separate 30% of the samples as the final test set. On each, we separate 10% of the training set for validation tuning.
We use a standard BiLSTM-max sequence encoder for these tasks Conneau et al. (2017). This model produces a sequence representation by max-pooling over the forward and backward hidden states produced a bidirectional LSTM. This representation is then passed through a MLP before final prediction. We found that validation performance was optimized with a 1-layer -d BiLSTM, followed by a -d MLP using a ReLU activation, a minibatch size of 64, dropout rate of , and normalizing the embeddings before input. We use a learning rate of with Adam, and divide the learning rate by a factor of if validation performance does not improve in 3 epochs, similar to Conneau et al. (2017); we schedule the learning rate in the same way for sequence labelling (§6.4).
In Table 3 we present the test set results from our classification experiments. We trained each BiLSTM-max 10 times with different random seeds for weight initialization and present the mean test accuracy plus/minus the standard deviation. These results show that each embedding model is similar, although GloVe is slightly worse than the others. Meanwhile, SGNS and Hilbert-MLE perform approximately the same, obtaining high quality results on both tasks.
6.4 Sequence labelling
Our final extrinsic evaluations are sequence labelling tasks on three datasets. The first task is supersense tagging (SST) Ciaramita and Altun (2006) on the Semcor 3.0 corpus Miller et al. (1993). SST is a coarse-grained semantic sequence labelling problem with 83 unique labels; we report results using the micro-F-score without the O-tag score due to the skew of label distribution, as is standard (Alonso and Plank, 2017; Changpinyo et al., 2018). We divide Semcor into a 70-30% train-test split, and use 10% of the training set for validation tuning. The second task is syntactic part-of-speech tagging (POS); we use the Penn TreeBank Wall Street Journal corpus (WSJ) Marcus et al. (1993) and the Brown corpus (as distributed by NLTK333https://www.nltk.org/book/ch02.html). For the WSJ, we use the given 44-tag tagset, and for Brown we map the original tags to the 12-tag “universal tagset” Petrov et al. (2012). We use sections 22, 23, and 24 of the WSJ corpus as its test set, and separate out 30% of the sentences in Brown as its test set.
On each dataset, we train a standard sequence labelling model inspired by Huang et al. (2015): a 2-layer, -d bidirectional LSTM, using a minibatch size of 16, and a dropout rate of 0.5. Interestingly, we found that normalizing the embeddings substantially reduced validation performance, so we keep them in their original form.
Sequence labelling results.
To accompany our results in Table 4, we include results from a trivial most-frequent-tag baseline. This baseline returns that the tag of a token is the most frequently occurring tag for that token within the training set. In SST it is standard to include results from a most-frequent-supersense baseline, being inspired from the tradition of word sense disambiguation, which uses the most-frequent-sense baseline.
The results for the BiLSTMs are the mean test set score across 10 different runs with different random seeds for weight initialization. The low standard deviations were approximately the same for each model. As in the classification tasks, we find that the embeddings produced by each model obtain very similar results. Nonetheless, we observe that Hilbert-MLE offers marginal improvements over the others. Note that our performance on WSJ and Brown is expected since we use vanilla BiLSTMs that do not include any hand engineered character- or context-based features. Indeed, Huang et al. report results of 96.04% on the WSJ with their vanilla BiLSTM, which suggests that our embeddings possess strong syntactic properties.
6.5 Qualitative analysis
|Argmax||Top most similar embeddings|
|kittens, cats, kitten, poodle|
|burglar, siamese, schrödinger|
|funds, monies, billions, cash|
|laundering, launder, extort|
|cubans, cuban, anti-castro|
|gooding, guantanamo, havana|
We provide a final set of qualitative results in Table 5. Here, we use the vectors and covectors trained by the Hilbert-MLE model used in our experiments. These results elucidate the difference between using vector-vector similarity versus vector-covector dot product similarity (results are practically the same when using cosine similarity). The vector-vector similarity is well known as a way to measure semantic similarity between two concepts captured in word embeddings. As expected, we see recoveries like “cat” is similar to “kitten”, “money” with “funds”, etc.
However, when we instead obtain the most similar covector to the corresponding vector, the results are dramatically different. We see that the vector for “cat” is most similar to covectors for words with which it forms multi-word expressions: “cat burglar”, ”siamese cat”, ”schrödinger’s cat”. We see that “cuba” is most similar to the covector for “gooding” – this is because Cuba Gooding Jr. is a famous American actor whose Wikipedia page appears in our corpus. Indeed, the vector-covector recoveries are directly correlated to the PMIs between the terms in the corpus.
Overall, we see that vector-vector dot products recover semantic similarity, while vector-covector dot products recover co-occurrence similarity. Melamud et al. (2015) and Asr and Jones (2017) discuss these different statistical recoveries as paradigmatic (target-to-target) and syntagmatic (target-to-context) recoveries, respectively. However, to our knowledge, previous work has not explicitly explained the reason for these two different types of recoveries; i.e., because the learning objective for word embeddings is to approximate PMI. Therefore, these results qualitatively demonstrate exactly what our hypothesis anticipates: the dot product between trained vectors and covectors approximates the PMI between their corresponding words in the original corpus.
In the past, probabilistic distributional semantic models for creating word embeddings surpassed the traditional count-based models Turney and Pantel (2010) that preceded them, which was well-established by Baroni et al. (2014). At the same time, models like Word2vec (SGNS), GloVe, and SVD of PPMI Mikolov et al. (2013a, b); Pennington et al. (2014); Levy et al. (2015) offered strong improvements (in terms of performance and efficiency) over other probabilistically-motivated embedding models Collobert and Weston (2008); Mnih and Hinton (2009); Turian et al. (2010); Mnih and Kavukcuoglu (2013).
Today, NLP seems to be orienting toward deep contextualized models Peters et al. (2018); Devlin et al. (2018). Nonetheless, pretrained word embeddings are still highly relevant. Indeed, they have been used recently to greatly assist solving problems in materials science Tshitoyan et al. (2019), biomedical text mining Zhang et al. (2019), and law Chalkidis and Kampas (2019). Moreover, word embeddings are used in dynamic meta-embeddings to obtain state-of-the-art results Kiela et al. (2018), are used as inputs to ELMO Peters et al. (2018), and are crucial in memory-constrained NLP contexts (such as in mobile devices, which cannot store large deep neural networks Shu and Nakayama (2017)).
We believe a robust understanding of the “shallow” (or, non-deep), uncontextualized embedding models presented in this work is a prerequisite for informed development of deeper models. In this work, we advanced the theoretical understanding of word embeddings by proposing the low rank embedder framework. Cast under this framework, the similarities between many existing algorithms become apparent. After isolating two key principles shared by the low rank embedders—a probabilistically informed bilinear parameterization of PMI and a “tempered” gradient conditioning measure—we demonstrate that these ideas are sufficient to derive Hilbert-MLE, a model that is simpler yet has performance that is equivalent or better than the other models. This provides a parsimonious explanation for the success of the low rank embedders, demonstrating that many idiosyncratic features of other embedders are unnecessary.
The design parameters of our framework have not yet been fully explored. Our framework could be used to model more complex linguistic phenomena, by following the blueprint provided by our derivation of Hilbert-MLE. Moreover, based on our findings concerning the importance of covectors in parameterizing the model’s approximation of PMI, we believe that methods such as retrofitting Faruqui et al. (2015), subword-based embedding decomposition Stratos (2017), and dynamic meta-embedding Kiela et al. (2018) could benefit by incorporating covectors into their modelling designs. As well, similar to how the theoretical basis of LDS Arora et al. (2016) was the grounding for the widely-used SIF document embedding algorithm Arora et al. (2017), we believe that the theoretical basis provided in this work can inform future development of document embedding techniques.
- A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of the 2009 Annual Conference of NAACL-HLT, pp. 19–27. Cited by: §6.1.
- When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Vol. 1, pp. 44–53. Cited by: §6.4.
- A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §3.3, §3.3, §3.3, Table 1, §7.
- A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations. Cited by: §3.3, §7.
- An artificial language evaluation of distributional semantic models. CoNLL 2017, pp. 134. Cited by: §6.5.
- Querying word embeddings for similarity and relatedness. In Proceedings of the 2018 Conference of NAACL-HLT, Volume 1 (Long Papers), pp. 675–684. Cited by: §1.
- An unsupervised model for instance level subcategorization acquisition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 278–289. Cited by: §6.1.
- Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247. Cited by: §7.
- Distributional semantics in technicolor. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 136–145. Cited by: §6.1.
- Semeval-2017 task 2: multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 15–26. Cited by: §6.1.
- On contrastive divergence learning.. In Aistats, Vol. 10, pp. 33–40. Cited by: §5.1.
- Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artificial Intelligence and Law 27 (2), pp. 171–198. Cited by: §7.
- Multi-task learning for sequence tagging: an empirical study. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2965–2977. Cited by: §6.4.
- Intrinsic evaluation of word vectors fails to predict extrinsic performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 1–6. Cited by: §6.1, §6.1.
- Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 594–602. Cited by: §6.4.
- A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. Cited by: §7.
- Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Cited by: §6.3.
- Waste not: meta-embedding of word and context vectors. In International Conference on Applications of Natural Language to Information Systems, pp. 393–401. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §7.
- Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of NAACL-HLT, pp. 1606–1615. Cited by: §7.
- Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 30–35. Cited by: §3.3, §6.1, §6.2.
- Placing search in context: the concept revisited. ACM Transactions on information systems 20 (1), pp. 116–131. Cited by: §6.1.
- Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t.. In Proceedings of the NAACL-HLT Student Research Workshop, pp. 8–15. Cited by: §6.2.
- A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57, pp. 345–420. Cited by: §1.
- English gigaword third edition ldc2007t07. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §6.
- Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1302–1310. Cited by: §5.1.
- Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §6.1.
- Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §5.1.
- Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §1, §6.4, §6.4.
- Bag of tricks for efficient text classification. European Association for Computational Linguistics 2017, pp. 427. Cited by: §3.3, §3.3.
- Clustering oriented representation learning with attractive-repulsive loss. AAAI Workshop: Network Interpretability for Deep Learning. Cited by: §6.3.
- Dynamic meta-embeddings for improved sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1466–1477. Cited by: §7, §7.
- Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. Cited by: §1.
- Adam: a method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations, Cited by: §6.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: §1, §1, §3.1, §3.3, §5.1, §7.
- Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning, pp. 171–180. Cited by: §6.2.
- Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pp. 2177–2185. Cited by: Figure 1, §1, §1, §3.1, §3.1, §3.3, §5.1, §5.1.
- Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3650–3656. Cited by: §3.1.
- Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 13–18. Cited by: §3.3, §6.2.
- Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §6.1.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 142–150. Cited by: §6.3.
- Word embedding and wordnet based metaphor identification and interpretation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
- Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2), pp. 313–330. Cited by: §6.4.
- A simple word embedding model for lexical substitution. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 1–7. Cited by: §1, §6.5.
- Efficient estimation of word representations in vector space. Proceedings of the 2013 International Conference on Learning Representations. Cited by: §6.2, §7.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §1, §1, §3.1, §3.1, Table 1, §5.1, §6, §7.
- A semantic concordance. In Proceedings of the Workshop on Human Language Technology, pp. 303–308. Cited by: §6.4.
- A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088. Cited by: §7.
- Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pp. 2265–2273. Cited by: §7.
- Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web, pp. 83–84. Cited by: §1.
- Automatic differentiation in PyTorch. In Neural Information Processing Systems 2017 Workshop on Automatic Differentiation, Cited by: §5.1, §6.
- GloVe: global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §1, §1, §3.2, Table 1, §5.1, §6, §7.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of NAACL-HLT, Volume 1 (Long Papers), Vol. 1, pp. 2227–2237. Cited by: §1, §7.
- A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Cited by: §6.4.
- Using the output embedding to improve language models. EACL 2017, pp. 157. Cited by: §1.
- When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of the 2018 Conference of NAACL-HLT, Volume 2 (Short Papers), pp. 529–535. Cited by: §1.
- A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World Wide Web, pp. 337–346. Cited by: §6.1.
- The (too many) problems of analogical reasoning with word vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pp. 135–148. Cited by: §3.3, §6.2.
- PIC a different word: a simple model for lexical substitution in context. In Proceedings of the 2016 Conference of the NAACL-HLT, pp. 1121–1126. Cited by: §1.
- Swivel: improving embeddings by noticing what’s missing. arXiv preprint arXiv:1602.02215. Cited by: §1, §3.3, Table 1, §5.1, §6.
- Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068. Cited by: §7.
- Reconstruction of word embeddings from sub-word parameters. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 130–135. Cited by: §7.
- Frequency estimates for statistical word similarity measures. In Proceedings of the 2003 Conference of NAACL-HLT - Volume 1, pp. 165–172. Cited by: §2.
- Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 (7763). Cited by: §7.
- Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pp. 384–394. Cited by: §7.
- From frequency to meaning: vector space models of semantics. Journal of artificial intelligence research 37, pp. 141–188. Cited by: §7.
- Generalized low rank models. Foundations and Trends® in Machine Learning 9 (1), pp. 1–118. Cited by: §3.2, §3.
- Verb similarity on the taxonomy of wordnet. Masaryk University. Cited by: §6.1.
- How to avoid sentences spelling boring? towards a neural approach to unsupervised metaphor generation. In Proceedings of the 2019 Conference of NAACL-HLT, Volume 1 (Long and Short Papers), pp. 861–871. Cited by: §1.
- Deep learning for sentiment analysis: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (4), pp. e1253. Cited by: §1.
- BioWordVec, improving biomedical word embeddings with subword information and mesh. Scientific data 6 (1). Cited by: §7.