Incorporating Subword Information into Matrix
Factorization Word Embeddings
The positive effect of adding subword information to word embeddings has been demonstrated for predictive models. In this paper we investigate whether similar benefits can also be derived from incorporating subwords into counting models. We evaluate the impact of different types of subwords (n-grams and unsupervised morphemes), with results confirming the importance of subword information in learning representations of rare and out-of-vocabulary words. 111This is a preprint of the paper that will be presented at the Second Workshop on Subword and Character LEvel Models in NLP (SCLeM) to be held at NAACL 2018.
Alexandre Salle Aline Villavicencio Institute of Informatics Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil firstname.lastname@example.org email@example.com
Low dimensional word representations (embeddings) have become a key component in modern NLP systems for language modeling, parsing, sentiment classification, and many others. These embeddings are usually derived by employing the distributional hypothesis: that similar words appear in similar contexts (Harris1954).
The models that perform the word embedding can be divided into two classes: predictive, which learn a target or context word distribution, and counting, which use a raw, weighted, or factored word-context co-occurrence matrix (Baroni2014). The most well-known predictive model, which has become eponymous with word embedding, is word2vec (Mikolov2013). Popular counting models include PPMI-SVD (Levy2014), GloVe (Pennington2014), and LexVec (Salle2016).
These models all learn word-level representations, which presents two main problems: 1) Learned information is not explicitly shared among the representations as each word has an independent vector. 2) There is no clear way to represent out-of-vocabulary (OOV) words.
fastText (Bojanowski2017) addresses these issues in the Skip-gram word2vec model by representing a word by the sum of a unique vector and a set of shared character n-grams (from hereon simply referred to as n-grams) vectors. This addresses both issues above as learned information is shared through the n-gram vectors and from these OOV word representations can be constructed.
In this paper we propose incorporating subword information into counting models using a strategy similar to fastText.
We use LexVec as the counting model as it generally outperforms PPMI-SVD and GloVe on intrinsic and extrinsic evaluations (Salle2016a; cer2017semeval; Wohlgenannt2017UsingWE; Konkol2017GeographicalEO), but the method proposed here should transfer to GloVe unchanged.
The LexVec objective is modified 222Our implementation of subword LexVec is available at https://github.com/alexandres/lexvec such that a word’s vector is the sum of all its subword vectors.
We compare 1) the use of n-gram subwords, like fastText, and 2) unsupervised morphemes identified using Morfessor (virpioja2013) to learn whether more linguistically motivated subwords offer any advantage over simple n-grams.
To evaluate the impact subword information has on in-vocabulary (IV) word representations, we run intrinsic evaluations consisting of word similarity and word analogy tasks. The incorporation of subword information results in similar gains (and losses) to that of fastText over Skip-gram. Whereas incorporating n-gram subwords tends to capture more syntactic information, unsupervised morphemes better preserve semantics while also improving syntactic results. Given that intrinsic performance can correlate poorly with performance on downstream tasks (Tsvetkov2015EvaluationOW), we also conduct evaluation using the VecEval suite of tasks (Nayak2016), in which
all subword models, including fastText, show no significant improvement over word-level models.
We verify the model’s ability to represent OOV words by quantitatively evaluating nearest-neighbors. Results show that, like fastText, both LexVec n-gram and (to a lesser degree) unsupervised morpheme models give coherent answers.
2 Related Work
Word embeddings that leverage subword information were first introduced by schutze1993word which represented a word of as the sum of four-gram vectors obtained running an SVD of a four-gram to four-gram co-occurrence matrix. Our model differs by learning the subword vectors and resulting representation jointly as weighted factorization of a word-context co-occurrence matrix is performed.
There are many models that use character-level subword information to form word representations (Ling2015FindingFI; Cao2016AJM; Kim2016CharacterAwareNL; Wieting2016CharagramEW; Verwimp2017CharacterWordLL), as well as fastText (the model on which we base our work). Closely related are models that use morphological segmentation in learning word representations (Luong2013; Botha2014CompositionalMF; qiu2014co; mitchell2015orthogonality; Cotterell2015MorphologicalW; Bhatia2016MorphologicalPF). Our model also uses n-grams and morphological segmentation, but it performs explicit matrix factorization to learn subword and word representations, unlike these related models which mostly use neural networks.
Finally, Cotterell2016MorphologicalSA and Vulic2017 retrofit morphological information onto pre-trained models. These differ from our work in that we incorporate morphological information at training time, and that only Cotterell2016MorphologicalSA is able to generate embeddings for OOV words.
3 Subword LexVec
The LexVec (Salle2016a) model factorizes the PPMI-weighted word-context co-occurrence matrix using stochastic gradient descent.
where is the word-context co-occurrence matrix constructed by sliding a window of fixed size centered over every target word
in the subsampled (Mikolov2013) training corpus and incrementing cell for every context word appearing within this window (forming a pair). LexVec adjusts the PPMI matrix using context distribution smoothing (Levy2014).
With the PPMI matrix calculated, the sliding window process is repeated and the following loss functions are minimized for every observed pair and target word :
where and are -dimensional word and context vectors. The second loss function describes how, for each target word, negative samples (Mikolov2013) are drawn from the smoothed context unigram distribution.
such that a word is the sum of its word vector and its -dimensional subword vectors . The number of possible subwords is very large so the function 333http://www.isthe.com/chongo/tech/comp/fnv/ hashes a subword to the interval . For OOV words,
We compare two types of subwords: simple n-grams (like fastText) and unsupervised morphemes. For example, given the word “cat”, we mark beginning and end with angled brackets and use all n-grams of length to as subwords, yielding . Morfessor (virpioja2013) is used to probabilistically segment words into morphemes. The Morfessor model is trained using raw text so it is entirely unsupervised. For the word “subsequent”, we get .
Our experiments aim to measure if the incorporation of subword information into LexVec results in similar improvements as observed in moving from Skip-gram to fastText, and whether unsupervised morphemes offer any advantage over n-grams. For IV words, we perform intrinsic evaluation via word similarity and word analogy tasks, as well as downstream tasks. OOV word representation is tested through qualitative nearest-neighbor analysis.
All models are trained using a 2015 dump of Wikipedia, lowercased and using only alphanumeric characters. Vocabulary is limited to words that appear at least times for a total of words. Morfessor is trained on this vocabulary list.
We train the standard LexVec (LV), LexVec using n-grams (LV-N), and LexVec using unsupervised morphemes (LV-M) using the same hyper-parameters as Salle2016a (, , , , , ).
Both Skip-gram (SG) and fastText (FT) are trained using the reference implementation444https://github.com/facebookresearch/fastText of fastText with the hyper-parameters given by Bojanowski2017 (, , , ).
All five models are run for iterations over the training corpus and generate dimensional word representations. LV-N, LV-M, and FT use buckets when hashing subwords.
For word similarity evaluations, we use the WordSim-353 Similarity (WS-Sim) and Relatedness (WS-Rel) (Finkelstein2001) and SimLex-999 (SimLex) (hill2015simlex) datasets, and the Rare Word (RW) (Luong2013) dataset to verify if subword information improves rare word representation. Relationships are measured using the Google semantic (GSem) and syntactic (GSyn) analogies (Mikolov2013) and the Microsoft syntactic analogies (MSR) dataset (Mikolov2013b).
We also evaluate all five models on downstream tasks from the VecEval suite (Nayak2016)555https://github.com/NehaNayak/veceval, using only the tasks for which training and evaluation data is freely available: chunking, sentiment and question classification, and natural language identification (NLI). The default settings from the suite are used, but we run only the fixed settings, where the embeddings themselves are not tunable parameters of the models, forcing the system to use only the information already in the embeddings.
Finally, we use LV-N, LV-M, and FT to generate OOV word representations for the following words: 1) “hellooo”: a greeting commonly used in instant messaging which emphasizes a syllable. 2) “marvelicious”: a made-up word obtained by merging “marvelous” and “delicious”. 3) “louisana”: a misspelling of the proper name “Louisiana”. 4) “rereread”: recursive use of prefix “re”. 5) “tuzread”: made-up prefix “tuz”.
|Word||Model||5 Nearest Neighbors|
|“hellooo”||LV-N||hellogoodbye, hello, helloworld, helloween, helluva|
|LV-M||kitsos, finos, neros, nonono, theodoroi|
|FT||hello, helloworld, hellogoodbye, helloween, joegazz|
|“marvelicious”||LV-N||delicious, marveled, marveling, licious, marvellous|
|LV-M||marveling, marvelously, marveled, marvelled, loquacious|
|FT||delicious, deliciously, marveling, licious, marvelman|
|“louisana”||LV-N||luisana, pisana, belisana, chiisana, rosana|
|LV-M||louisy, louises, louison, louiseville, louisiade|
|FT||luisana, louisa, belisana, anabella, rosana|
|“rereread”||LV-N||reread, rereading, read, writeread, rerecord|
|LV-M||alread, carreer, whiteread, unremarked, oread|
|FT||reread, rereading, read, reiterate, writeread|
|“tuzread”||LV-N||tuzi, tuz, tuzla, prizren, momchilgrad, studenica|
|LV-M||tuzluca, paczk, goldsztajn, belzberg, yizkor|
|FT||pazaryeri, tufanbeyli, yenipazar, leskovac, berovo|
Like in FT, the use of subword information in both LV-N and LV-M results in 1) better representation of rare words, as evidenced by the increase in RW correlation, and 2) significant improvement on the GSyn and MSR tasks, in evidence of subwords encoding information about a word’s syntactic function (the suffix “ly”, for example, suggests an adverb).
There seems to a trade-off between capturing semantics and syntax as in both LV-N and FT there is an accompanying decrease on the GSem tasks in exchange for gains on the GSyn and MSR tasks. Morphological segmentation in LV-M appears to favor syntax less strongly than do simple n-grams.
On the downstream tasks, we only observe statistically significant ( under a random permutation test) improvement on the chunking task, and it is a very small gain. We attribute this to both regular and subword models having very similar quality on frequent IV word representation. Statistically, these are the words are that are most likely to appear in the downstream task instances, and so the superior representation of rare words
has, due to their nature, little impact on overall accuracy. Because in all tasks OOV words are mapped to the “unk” token, the subword models are not being used to the fullest, and in future work we will investigate whether generating representations for all words improves task performance.
In OOV representation (table 2), LV-N and FT work almost identically, as is to be expected. Both find highly coherent neighbors for the words “hellooo”, “marvelicious”, and “rereread”. Interestingly, the misspelling of “louisana” leads to coherent name-like neighbors, although none is the expected correct spelling “louisiana”. All models stumble on the made-up prefix “tuz”. A possible fix would be to down-weigh very rare subwords in the vector summation. LV-M is less robust than LV-N and FT on this task as it is highly sensitive to incorrect segmentation, exemplified in the “hellooo” example.
Finally, we see that nearest-neighbors are a mixture of similarly pre/suffixed words. If these pre/suffixes are semantic, the neighbors are semantically related, else if syntactic they have similar syntactic function. This suggests that it should be possible to get tunable representations which are more driven by semantics or syntax by a weighted summation of subword vectors, given we can identify whether a pre/suffix is semantic or syntactic in nature and weigh them accordingly. This might be possible without supervision using corpus statistics as syntactic subwords are likely to be more frequent, and so could be down-weighted for more semantic representations. This is something we will pursue in future work.
6 Conclusion and Future Work
In this paper, we incorporated subword information (simple n-grams and unsupervised morphemes) into the LexVec word embedding model and evaluated its impact on the resulting IV and OOV word vectors. Like fastText, subword LexVec learns better representations for rare words than its word-level counterpart. All models generated coherent representations for OOV words, with simple n-grams demonstrating more robustness than unsupervised morphemes. In future work, we will verify whether using OOV representations in downstream tasks improves performance. We will also explore the trade-off between semantics and syntax when subword information is used.