Incorporating Subword Information into Matrix Factorization Word Embeddings

Incorporating Subword Information into Matrix
Factorization Word Embeddings

Alexandre Salle  Aline Villavicencio
Institute of Informatics
Universidade Federal do Rio Grande do Sul
Porto Alegre, Brazil

The positive effect of adding subword information to word embeddings has been demonstrated for predictive models. In this paper we investigate whether similar benefits can also be derived from incorporating subwords into counting models. We evaluate the impact of different types of subwords (n-grams and unsupervised morphemes), with results confirming the importance of subword information in learning representations of rare and out-of-vocabulary words. 111This is a preprint of the paper that will be presented at the Second Workshop on Subword and Character LEvel Models in NLP (SCLeM) to be held at NAACL 2018.

Incorporating Subword Information into Matrix
Factorization Word Embeddings

Alexandre Salle  Aline Villavicencio Institute of Informatics Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil

1 Introduction

Low dimensional word representations (embeddings) have become a key component in modern NLP systems for language modeling, parsing, sentiment classification, and many others. These embeddings are usually derived by employing the distributional hypothesis: that similar words appear in similar contexts (Harris1954).

The models that perform the word embedding can be divided into two classes: predictive, which learn a target or context word distribution, and counting, which use a raw, weighted, or factored word-context co-occurrence matrix (Baroni2014). The most well-known predictive model, which has become eponymous with word embedding, is word2vec (Mikolov2013). Popular counting models include PPMI-SVD (Levy2014), GloVe (Pennington2014), and LexVec (Salle2016).

These models all learn word-level representations, which presents two main problems: 1) Learned information is not explicitly shared among the representations as each word has an independent vector. 2) There is no clear way to represent out-of-vocabulary (OOV) words.

fastText (Bojanowski2017) addresses these issues in the Skip-gram word2vec model by representing a word by the sum of a unique vector and a set of shared character n-grams (from hereon simply referred to as n-grams) vectors. This addresses both issues above as learned information is shared through the n-gram vectors and from these OOV word representations can be constructed.

In this paper we propose incorporating subword information into counting models using a strategy similar to fastText.

We use LexVec as the counting model as it generally outperforms PPMI-SVD and GloVe on intrinsic and extrinsic evaluations (Salle2016a; cer2017semeval; Wohlgenannt2017UsingWE; Konkol2017GeographicalEO), but the method proposed here should transfer to GloVe unchanged.

The LexVec objective is modified 222Our implementation of subword LexVec is available at such that a word’s vector is the sum of all its subword vectors.

We compare 1) the use of n-gram subwords, like fastText, and 2) unsupervised morphemes identified using Morfessor (virpioja2013) to learn whether more linguistically motivated subwords offer any advantage over simple n-grams.

To evaluate the impact subword information has on in-vocabulary (IV) word representations, we run intrinsic evaluations consisting of word similarity and word analogy tasks. The incorporation of subword information results in similar gains (and losses) to that of fastText over Skip-gram. Whereas incorporating n-gram subwords tends to capture more syntactic information, unsupervised morphemes better preserve semantics while also improving syntactic results. Given that intrinsic performance can correlate poorly with performance on downstream tasks (Tsvetkov2015EvaluationOW), we also conduct evaluation using the VecEval suite of tasks (Nayak2016), in which

all subword models, including fastText, show no significant improvement over word-level models.

We verify the model’s ability to represent OOV words by quantitatively evaluating nearest-neighbors. Results show that, like fastText, both LexVec n-gram and (to a lesser degree) unsupervised morpheme models give coherent answers.

This paper discusses related word (2), introduces the subword LexVec model (3), describes experiments (4), analyzes results (5), and concludes with ideas for future works (6).

2 Related Work

Word embeddings that leverage subword information were first introduced by schutze1993word which represented a word of as the sum of four-gram vectors obtained running an SVD of a four-gram to four-gram co-occurrence matrix. Our model differs by learning the subword vectors and resulting representation jointly as weighted factorization of a word-context co-occurrence matrix is performed.

There are many models that use character-level subword information to form word representations (Ling2015FindingFI; Cao2016AJM; Kim2016CharacterAwareNL; Wieting2016CharagramEW; Verwimp2017CharacterWordLL), as well as fastText (the model on which we base our work). Closely related are models that use morphological segmentation in learning word representations (Luong2013; Botha2014CompositionalMF; qiu2014co; mitchell2015orthogonality; Cotterell2015MorphologicalW; Bhatia2016MorphologicalPF). Our model also uses n-grams and morphological segmentation, but it performs explicit matrix factorization to learn subword and word representations, unlike these related models which mostly use neural networks.

Finally, Cotterell2016MorphologicalSA and Vulic2017 retrofit morphological information onto pre-trained models. These differ from our work in that we incorporate morphological information at training time, and that only Cotterell2016MorphologicalSA is able to generate embeddings for OOV words.

3 Subword LexVec

The LexVec (Salle2016a) model factorizes the PPMI-weighted word-context co-occurrence matrix using stochastic gradient descent.


where is the word-context co-occurrence matrix constructed by sliding a window of fixed size centered over every target word

in the subsampled (Mikolov2013) training corpus and incrementing cell for every context word appearing within this window (forming a pair). LexVec adjusts the PPMI matrix using context distribution smoothing (Levy2014).

With the PPMI matrix calculated, the sliding window process is repeated and the following loss functions are minimized for every observed pair and target word :


where and are -dimensional word and context vectors. The second loss function describes how, for each target word, negative samples (Mikolov2013) are drawn from the smoothed context unigram distribution.

Given a set of subwords for a word , we follow fastText and replace in eqs. 3 and 2 by such that:


such that a word is the sum of its word vector and its -dimensional subword vectors . The number of possible subwords is very large so the function 333 hashes a subword to the interval . For OOV words,


We compare two types of subwords: simple n-grams (like fastText) and unsupervised morphemes. For example, given the word “cat”, we mark beginning and end with angled brackets and use all n-grams of length to as subwords, yielding . Morfessor (virpioja2013) is used to probabilistically segment words into morphemes. The Morfessor model is trained using raw text so it is entirely unsupervised. For the word “subsequent”, we get .

4 Materials

Our experiments aim to measure if the incorporation of subword information into LexVec results in similar improvements as observed in moving from Skip-gram to fastText, and whether unsupervised morphemes offer any advantage over n-grams. For IV words, we perform intrinsic evaluation via word similarity and word analogy tasks, as well as downstream tasks. OOV word representation is tested through qualitative nearest-neighbor analysis.

All models are trained using a 2015 dump of Wikipedia, lowercased and using only alphanumeric characters. Vocabulary is limited to words that appear at least times for a total of words. Morfessor is trained on this vocabulary list.

We train the standard LexVec (LV), LexVec using n-grams (LV-N), and LexVec using unsupervised morphemes (LV-M) using the same hyper-parameters as Salle2016a (, , , , , ).

Both Skip-gram (SG) and fastText (FT) are trained using the reference implementation444 of fastText with the hyper-parameters given by Bojanowski2017 (, , , ).

All five models are run for iterations over the training corpus and generate dimensional word representations. LV-N, LV-M, and FT use buckets when hashing subwords.

For word similarity evaluations, we use the WordSim-353 Similarity (WS-Sim) and Relatedness (WS-Rel) (Finkelstein2001) and SimLex-999 (SimLex) (hill2015simlex) datasets, and the Rare Word (RW) (Luong2013) dataset to verify if subword information improves rare word representation. Relationships are measured using the Google semantic (GSem) and syntactic (GSyn) analogies (Mikolov2013) and the Microsoft syntactic analogies (MSR) dataset (Mikolov2013b).

We also evaluate all five models on downstream tasks from the VecEval suite (Nayak2016)555, using only the tasks for which training and evaluation data is freely available: chunking, sentiment and question classification, and natural language identification (NLI). The default settings from the suite are used, but we run only the fixed settings, where the embeddings themselves are not tunable parameters of the models, forcing the system to use only the information already in the embeddings.

Finally, we use LV-N, LV-M, and FT to generate OOV word representations for the following words: 1) “hellooo”: a greeting commonly used in instant messaging which emphasizes a syllable. 2) “marvelicious”: a made-up word obtained by merging “marvelous” and “delicious”. 3) “louisana”: a misspelling of the proper name “Louisiana”. 4) “rereread”: recursive use of prefix “re”. 5) “tuzread”: made-up prefix “tuz”.

5 Results

Evaluation LV LV-N LV-M SG FT
WS-Sim .749 .748 .746 .783 .778
WS-Rel .627 .627 .625 .683 .672
SimLex .359 .374 .366 .371 .367
RW .461 .522 .479 .481 .500
GSem 80.7 73.8 80.7 78.9 77.0
GSyn 62.8 68.6 63.8 68.2 71.1
MSR 49.6 55.0 53.8 57.8 59.6
Chunk 90.4 90.6 90.5 90.4 90.4
Sentiment 77.0 77.0 77.6 75.3 77.9
Questions 87.4 87.4 87.3 86.6 85.1
NLI 43.3 43.4 43.3 43.4 43.8
Table 1: Word similarity (Spearman’s rho), analogy (% accuracy), and downstream task (% accuracy) results. In downstream tasks, for the same model accuracy varies over different runs, so we report the mean over runs, in which the only significantly ( under a random permutation test) different result is in chunking.
Word Model 5 Nearest Neighbors
“hellooo” LV-N hellogoodbye, hello, helloworld, helloween, helluva
LV-M kitsos, finos, neros, nonono, theodoroi
FT hello, helloworld, hellogoodbye, helloween, joegazz
“marvelicious” LV-N delicious, marveled, marveling, licious, marvellous
LV-M marveling, marvelously, marveled, marvelled, loquacious
FT delicious, deliciously, marveling, licious, marvelman
“louisana” LV-N luisana, pisana, belisana, chiisana, rosana
LV-M louisy, louises, louison, louiseville, louisiade
FT luisana, louisa, belisana, anabella, rosana
“rereread” LV-N reread, rereading, read, writeread, rerecord
LV-M alread, carreer, whiteread, unremarked, oread
FT reread, rereading, read, reiterate, writeread
“tuzread” LV-N tuzi, tuz, tuzla, prizren, momchilgrad, studenica
LV-M tuzluca, paczk, goldsztajn, belzberg, yizkor
FT pazaryeri, tufanbeyli, yenipazar, leskovac, berovo
Table 2: We generate vectors for OOV using subword information and search for the nearest (cosine distance) words in the embedding space. The LV-M segmentation for each word is: , , , , . We omit the LV-N and FT n-grams as they are trivial and too numerous to list.

Results for IV evaluation are shown in table 1, and for OOV in table 2.

Like in FT, the use of subword information in both LV-N and LV-M results in 1) better representation of rare words, as evidenced by the increase in RW correlation, and 2) significant improvement on the GSyn and MSR tasks, in evidence of subwords encoding information about a word’s syntactic function (the suffix “ly”, for example, suggests an adverb).

There seems to a trade-off between capturing semantics and syntax as in both LV-N and FT there is an accompanying decrease on the GSem tasks in exchange for gains on the GSyn and MSR tasks. Morphological segmentation in LV-M appears to favor syntax less strongly than do simple n-grams.

On the downstream tasks, we only observe statistically significant ( under a random permutation test) improvement on the chunking task, and it is a very small gain. We attribute this to both regular and subword models having very similar quality on frequent IV word representation. Statistically, these are the words are that are most likely to appear in the downstream task instances, and so the superior representation of rare words

has, due to their nature, little impact on overall accuracy. Because in all tasks OOV words are mapped to the “unk” token, the subword models are not being used to the fullest, and in future work we will investigate whether generating representations for all words improves task performance.

In OOV representation (table 2), LV-N and FT work almost identically, as is to be expected. Both find highly coherent neighbors for the words “hellooo”, “marvelicious”, and “rereread”. Interestingly, the misspelling of “louisana” leads to coherent name-like neighbors, although none is the expected correct spelling “louisiana”. All models stumble on the made-up prefix “tuz”. A possible fix would be to down-weigh very rare subwords in the vector summation. LV-M is less robust than LV-N and FT on this task as it is highly sensitive to incorrect segmentation, exemplified in the “hellooo” example.

Finally, we see that nearest-neighbors are a mixture of similarly pre/suffixed words. If these pre/suffixes are semantic, the neighbors are semantically related, else if syntactic they have similar syntactic function. This suggests that it should be possible to get tunable representations which are more driven by semantics or syntax by a weighted summation of subword vectors, given we can identify whether a pre/suffix is semantic or syntactic in nature and weigh them accordingly. This might be possible without supervision using corpus statistics as syntactic subwords are likely to be more frequent, and so could be down-weighted for more semantic representations. This is something we will pursue in future work.

6 Conclusion and Future Work

In this paper, we incorporated subword information (simple n-grams and unsupervised morphemes) into the LexVec word embedding model and evaluated its impact on the resulting IV and OOV word vectors. Like fastText, subword LexVec learns better representations for rare words than its word-level counterpart. All models generated coherent representations for OOV words, with simple n-grams demonstrating more robustness than unsupervised morphemes. In future work, we will verify whether using OOV representations in downstream tasks improves performance. We will also explore the trade-off between semantics and syntax when subword information is used.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description