A Critique of the Smooth Inverse Frequency Sentence Embeddings
We critically review the smooth inverse frequency sentence embedding method of \citeauthorarora2017simple \shortcitearora2017simple, and show inconsistencies in its setup, derivation and evaluation.
The smooth inverse frequency (SIF) sentence embedding method of \citeauthorarora2017simple \shortcitearora2017simple has gained attention in the NLP community due to its simplicity and competetive performance. We recognize the strengths of this method, but we argue that its theoretical justification contains a number of flaws. In what follows we show that there are contradictory arguments in the setup, derivation and experimental evaluation of SIF.
We first recall the word production model used by the authors as the foundation of SIF: given the context vector , the probability that a word is emitted in the context is modeled by
where are scalar hyperparameters, is a word embedding for , is the so-called common discourse, and is the normalizing constant.
The authors empirically find (see their section 4.1.1) that the optimal value of satisfies
where . In their previous work, \citeauthorarora2016latent \shortcitearora2016latent showed (see the proof sketch of their Lemma 2.1 on p. 398) that under isotropic assumption on ’s,
where is a random variable upper bounded by a constant. From (4) we have , and combining this with the right inequality from (3) we have . For a typical vocabulary size this implies , which means that the generative model (1) is essentially a unigram model that practically ignores the context.
Treating any sentence as a sequence of words , the authors construct its log-likelihood given the smoothed context vector as . Then this log-likelihood is linearized using Taylor expansion at :
It is possible to have a valid derivation of the SIF sentence embedding as the Maximum-a-Posteriori (MAP) estimate of given once we assume the generative model
instead of (1). The proof is similar to that of Lemma 3.1 in \citeauthorarora2016latent \shortcitearora2016latent, and is left as exercise. Now, assume that is a single context word, and is its embedding. Taking logarithm of both sides in (6), then solving for and assuming that the normalizer in (6) concentrates well around a constant , we have
This means that the word and context embeddings that underlie the language model (6) give a low-rank approximation of a matrix in which the element in row and column is equal to the right-hand side of (7). It is well known that the word and context embeddings that underlie the SGNS training objective give a low-rank approximation of the shifted PMI matrix, and that factorizing the latter with truncated SVD gives embeddings of similar quality . This means, that if the model (6) is adequate, then the truncated SVD of should give us good-quality word embeddings as well. We calculated the shifted PMI and on text8 data111http://mattmahoney.net/dc/textdata.html using vocabulary size and then performed rank-200 approximation. The resulting embeddings were evaluated on standard similarity (WordSim) and analogy (Google and MSR) tasks. The hyperparameter was tuned using grid search and the optimal value was s.t. . The results of evaluation are given in Table 1.
|Matrix||Similarity task||Analogy Task|
In fact, the method of \citeauthorarora2017simple \shortcitearora2017simple is not only in using SIF weights but also in removing the principal component from the resulting sentence embeddings. When the authors evaluate their method against a simple average (Avg) of word vectors, they do not consider principal component removal (PCR) as a separate factor, i.e. they do not compare against a simple average of word embeddings followed by a principal component removal (Avg+PCR). We performed such comparison on datasets from the SemEval Semantic Textual Similarity (STS) tasks222http://ixa2.si.ehu.es/stswiki/index.php/Main˙Page with GloVe and SGNS embeddings, and the results are illustrated on Fig. 1. As we can see, SIF is indeed stronger than Avg, but this advantage is diminished when we remove the principal components from both. Looking at the boxplots, one may think that the difference between Avg+PCR and SIF+PCR is not significant, however this is not the case: SIF+PCR demonstrates higher scores than Avg+PCR according to paired one-sided Wilcoxon signed rank test, with p-values for both GloVe and SGNS embeddings. Thus, we admit that the overall claim of the authors is valid: SIF outperforms Avg with and without PCR.
The sentence embedding method of \citeauthorarora2017simple \shortcitearora2017simple is indeed a simple but tough-to-beat baseline, which has a clear underlying intuition that the embeddings of too frequent words should be downweighted when summed with those of less frequent ones. However, one does not need to tweak a previously developped mathematical theory to justify this empirical finding: in pursuit of mathematical validity, the SIF authors made a number of errors.
-  (2014) Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185. Cited by: Model Inadequacy.