Autoencoding Improves Pre-trained Word Embeddings

Autoencoding Improves Pre-trained Word Embeddings

Abstract

Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically this post-processing step is equivalent to applying a linear autoencoder to minimise the squared reconstruction error. This result contradicts prior work [mu2018allbutthetop] that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labeled data.

\colingfinalcopy\externaldocument

Supplementary

§ 1 Introduction

Pre-trained word embeddings have been successfully used as features for representing input texts in many NLP tasks [Dhillon:2015, Mnih:HLBL:NIPS:2008, Collobert:2011, Huang:ACL:2012, Milkov:2013, Pennington:EMNLP:2014]. \newcitemu2018allbutthetop showed that the accuracy of pre-trained word embeddings can be further improved in a post-processing step, without requiring additional training data, by removing the mean of the word embeddings (centering) computed over the set of words (i.e. vocabulary) and projecting onto the directions defined by the principal component vectors, excluding the top principal components. They empirically showed that pre-trained word embeddings are distributed in a narrow cone around the mean embedding vector, and centering and projection help to reinstate isotropy in the embedding space. This post-processing operation has been repeatedly proposed in different contexts such as with distributional (counting-based) word representations [sahlgren-etal-2016-gavagai] and sentence embeddings [Arora:ICLR:2017].

Independently to the above, autoencoders have been widely used for fine-tuning pre-trained word embeddings such as for removing gender bias [kaneko-bollegala-2019-gender], meta-embedding [Bao:COLING:2018], cross-lingual word embedding [Wei:IJCAI:2017] and domain adaptation [Chen:ICML:2012], to name a few. However, it is unclear whether better performance is obtained simply by applying an autoencoder (a self-supervised task, requiring no labelled data) on pre-trained word embeddings, without performing any task-specific fine-tuning (requires labelled data for the task).

A connection between principal component analysis (PCA) and linear autoencoders was first proved by \newciteBaldi:1989, extending the analysis by \newciteBourlard:1988. We revisit this analysis and theoretically prove that one must retain the largest principal components instead of removing them as proposed by \newcitemu2018allbutthetop in order to minimise the squared reconstruction loss.

Next, we experimentally show that by applying a non-linear autoencoder we can post-process a given set of pre-trained word embeddings and obtain more accurate word embeddings than by the method proposed by \newcitemu2018allbutthetop. This is consistent with the results of \newciteraunak-etal-2020-dimensional who argue that removing the top principal components does not necessarily lead to performance improvement. Although \newcitemu2018allbutthetop motivated the removal of largest principal components as a method to improve the isotropy of the word embeddings, our empirical findings show that autoencoding automatically improves isotropy.

§ 2 Autoencoding as Centering and PCA Projection

Let us consider a set of -dimensional pre-trained word embeddings, for a vocabulary, , consisting of words. We post-process these pre-trained word embeddings using an autoencoder consisting of a single dimensional hidden layer, an encoder (defined and bias ) and a decoder (defined by and bias ). Let be the embedding matrix. Using matrices , and respectively denoting the activations, hidden states and reconstructed output embeddings, the autoencoder can be specified as follows.

Here, is a vector consisting of ones and is an element-wise activation function. The squared reconstruction loss, , for the autoencoder is given by (1).

(1)

The reconstruction loss of the autoencoder is given by Lemma 2, proved in the supplementary.

Lemma \thetheorem.

Let and respectively denote the centred embedding and hidden state matrices. Then, (1) can be expressed using and as , where the decoder’s optimal bias vector is given by .

Lemma 2 holds even for non-linear autoencoders and claims that the centering happens automatically during the minimisation of the reconstruction error. Following Lemma 2, we can assume that the embedding matrix, , to be already centred and can limit further discussions to this case. Moreover, after centering the input embeddings, the biases can be absorbed into the encoder/decoder matrices by setting an extra dimension that is always equal to 1 in the pre-trained word embeddings. This has the added benefit of simplifying the notations and proofs. Under these conditions § 2 shows an important connection between linear autoencoders and PCA. {theorem} Assume that is full-rank with distinct eigenvalues . Let be any ordered -index set, and denote the matrix formed by the orthogonal eigenvectors of associated with the eigenvalues . Then, two full-rank matrices and define a critical point of (1) for a linear autoencoder if and only if there exists an ordered -index set and an invertible matrix such that

(2)
(3)

Moreover, the reconstruction error, can be expressed as

(4)

Proof of § 2 and approximations for non-linear activations are given in the supplementary.

Because is a covariance matrix, it is positive semi-definite. Strict positivity corresponds to it being full-rank and is usually satisfied in practice for pre-trained word embeddings, which are dense and use a small independent dimensions for representing the semantics of the words. Moreover, , are randomly initialised in practice making them full-rank as assumed in § 2.

The connection between linear autoencoders and PCA was first proved by \newciteBaldi:1989, extending the analysis by \newciteBourlard:1988. Reconstructing the principal component vectors from an autoencoder has been discussed by \newcitePlaut:2018 without any formal proofs. However, to the best of our knowledge, a theoretical justification for post-processing pre-trained word embeddings by autoencoding has not been provided before.

According to § 2, we can minimise (4) by selecting the largest eigenvalues as . This result contradicts the proposal by \newcitemu2018allbutthetop to project the word embeddings away from the largest principal component vectors, which is motivated as a method to improve isotropy in the word embedding space. They provided experimental evidence to the effect that largest principal component vectors encode word frequency and removal of them is not detrimental to semantic tasks such as semantic similarity measurement and analogy detection. However, the frequency of a word is an important piece of information for tasks that require differentiating stop words and content words such as in information retrieval. Moreover, contextualised word embeddings such as BERT [BERT] and Elmo [Elmo] have shown to be anisotropic despite their superior performance in a wide-range of NLP tasks [ethayarajh-2019-contextual]. Therefore, it is not readily obvious whether removing the largest principal components to satisfy isotropy is a universally valid strategy. On the other hand, our experimental results show that by autoencoding not only we obtain better embeddings than \newcitemu2018allbutthetop, but also it improves the isotropy of the pre-trained word embeddings.

§ 3 Experiments

Parameter Value
Optimizer Adam
Learning rate 0.0002
Dropout rate 0.2
Batch size 256
Activation function
Table 1: Hyperparameter values of the autoencoder.

To evaluate the proposed post-processing method, we use the following pre-trained word embeddings: Word2Vec1 (300-dimensional embeddings for ca. 3M words learnt from the Google News corpus), GloVe2 (300-dimensional word embeddings for ca. 2.1M words learnt from the Common Crawl), and fastText3 (300-dimensional embeddings for ca. 2M words learnt from the Common Crawl).

We use the following benchmarks datasets: for semantic similarity WS-353; \newciteAgirre:ACL:2009, SIMLEX-999; \newciteSimLex, RG-65; \newciteRG, MTurk-287; \newciteRadinsky:WWW:2011, MTurk-771; \newciteHalawi:KDD:2012 and MEN; \newciteMEN, for analogy Google, MSR [Milkov:2013], and SemEval; \newciteSemEavl2012:Task2) and for concept categorisation BLESS; \newciteBLESS:2011 and ESSLI; \newciteESSLLI) to evaluate word embeddings.

Table 1 lists the hyperparameters and their values for the autoencoder-based post-processing method used in the experiments. We used the syntactic analogies in the MSR; \newciteMilkov:2013 dataset for setting the hyperparameters. We input each set of embeddings separately to an autoencoder with one hidden layer and minimise the squared error using Adam as the optimiser. The pre-trained embeddings are then sent through the trained autoencoder and its hidden layer outputs are used as the post-processed word embeddings. We train an autoencoder (denoted as AE) with a -dimensional hidden layer and a activation. Moreover, to study the effect of nonlinearities we train the a linear autoencoder (LAE) without using any nonlinear activation functions in its -dimensional hidden layer. Due to space limitations, we show results for autoencoders with different hidden layer sizes in the supplementary. We compare the embeddings post-processed using ABTT (stands for all-but-the-top[mu2018allbutthetop], which removes the top principal components from the pre-trained embeddings.

{adjustbox}

width=center Embedding Word2Vec GloVe fastText Dataset Original ABTT LAE AE Original ABTT LAE AE Original ABTT LAE AE WS-353 62.4 61.2 61.8 61.8 60.6 61.5 64.0 65.8 65.9 67.7 69.0 69.0 SIMLEX-999 44.7 45.4 45.5 45.5 39.5 41.5 40.8 42.2 46.2 47.4 48.8 48.8 RG-65 75.4 76.0 76.2 76.3 68.1 68.0 71.4 72.3 78.4 81.4 80.4 80.5 MTurk-287 69.0 68.9 69.0 68.9 71.8 71.9 73.6 74.4 73.3 73.8 74.7 74.7 MTurk-771 63.1 63.7 63.8 63.9 62.7 63.7 66.2 67.7 69.6 71.8 72.3 72.4 MEN 68.1 68.3 69.2 69.3 67.7 69.5 73.0 74.8 71.1 75.7 75.9 76.0 MSR 73.6 73.2 73.5 73.4 73.8 73.2 74.3 74.4 87.1 88.0 87.3 87.3 Google 74.0 74.8 74.3 74.3 76.8 76.9 77.2 77.1 85.3 88.0 86.4 86.4 SemEval 20.0 19.9 20.4 20.3 15.4 17.2 17.2 17.6 21.0 23.2 23.2 23.3 BLESS 70.5 71.0 68.5 70.0 76.5 76.5 75.0 79.5 75.5 79.0 79.5 80.5 ESSLI 75.5 73.7 73.8 76.2 72.2 72.2 73.0 73.0 74.7 76.2 76.1 77.0

Table 2: Results are shown for the original embeddings and their post-processed versions by ABTT, linear autoencoder (LAE) and nonlinear autoencoder (AE) for pre-trained Word2Vec, GloVe and fastText embeddings.

Table 2 compares the performance of the Original embeddings against the embeddings post-processed using ABTT, LAE and AE. For the semantic similarity task, a high degree of Spearman correlation between human similarity ratings and the cosine similarity scores computed using the word embeddings is considered as better. From Table 2 we see that AE improves word embeddings and outperforms ABTT in almost all semantic similarity datasets. For the word analogy task, we use the PairDiff method [Levy:CoNLL:2014] to predict the fourth word needed to complete a proportional analogy and the accuracy of the prediction is reported. For the word analogy task, we see that for the GloVe embeddings AE reports the best performance but ABTT performs better for fastText. Overall, the improvements due to post-processing are less prominent in the word analogy task. This behaviour was also observed by \newcitemu2018allbutthetop and is explained by the fact that analogy solving is done using vector difference, which is not influenced by centering.

In the concept categorisation task, we measure the Euclidean distance between two words, computed using their embeddings as the distance measure, and use the -means clustering algorithm to group words into clusters separately in each benchmark dataset. Cluster purity [Manning:IR] is computed as the evaluation measure using the gold category labels provided in each benchmark dataset. High values of purity would indicate that the word embeddings capture information related to the semantic classes of words. From Table 2 we see that AE outperforms ABTT in all cases, except on BLESS with Word2Vec embeddings.

From Table 2 we see that for the pre-trained GloVe and fastText embeddings, both linear (LAE) and non-linear autoencoders (AE) yield consistently better post-processed embeddings than the original embeddings. For the pre-trained Word2Vec embeddings, we see that using LAE or AE produces better embeddings in seven out of the eleven benchmark datasets. However, according to Fisher transformation we see that the performance difference between LAE and AE is not statistically significant in most datasets. Considering the theoretical equivalence between PCA and linear autoencoders, this result shows that it is more important to perform centering and apply PCA rather than using a non-linear activation in the hidden layer of the autoencoder.

Original ABTT LAE AE
Word2Vec 0.489 0.981 0.963 0.976
GloVe 0.018 0.943 0.782 0.884
fastText 0.773 0.995 0.992 0.990
Table 3: The measure of isotropy of original embeddings and after post-processed using ABTT and AE.

Following the definition given by \newcitemu2018allbutthetop, we empirically estimate the isotropy of a set of embeddings as , where is the set of principal component vectors computed for the given set of pre-trained word embeddings and is the normalisation coefficient in the partition function defined by \newciteArora:TACL:2016. values close to one indicate a high level of isotropy in the embedding space. From Table 3 we see that compared to the original embeddings ABTT, LAE and AE all improve isotropy.

An alternative approach to verify isotropy is to check whether is a constant independent of , which is also known as the self-normalisation property [andreas-klein:2015:NAACL-HLT]. Figure 1 shows the histogram of of the original pre-trained embeddings, post-processed embeddings using ABTT and AE for pre-trained (a) Word2Vec, (b) GloVe and (c) fastText embeddings for a set of randomly chosen 1000 words with unit norm. Horizontal axes are normalized by the mean of the values. From Figure 1, we see that the original word embeddings in all Word2Vec, GloVe and fastText are far from being isotropic. On the other hand, AE word embeddings are isotorpic, similar to ABTT word embeddings, in all Word2Vec, GloVe and fastText. This result shows that isotropy materialises automatically during autoencoding and does not require special processing such as removing the top pricipal components as done by ABTT.

In addition to the theoretical and empirical advantages of autoencoding as a post-processing method, it is also practically attractive. For example, unlike PCA, which must be computed using the embeddings for all the words in the vocabulary, autoencoders could be run in an online fashion using only a small mini-batch of words at a time. Moreover, non-linear transformations and regularisation (e.g. in the from of dropout) can be easily incorporated into autoencoders, which can also be stacked for further post-processing. Although online [NIPS2006_3144, NIPS2013_5135, Feng:2013] and non-linear [Scholz_2005] variants of PCA have been proposed, they have not been popular among practitioners due to their computational complexity, scalability and the lack of availability in deep learning frameworks.

(a) Word2Vec
(b) GloVe
(c) fastText
Figure 1: The histogram of on Word2Vec, GloVe and fastText for 1,000 random vectors of unit norm. The x-axis is normalised by the mean of the values.

§ 4 Conclusion

We showed that autoencoding improves pre-trained word embeddings and outperforms the prior proposal for removing top principal components. Unlike PCA, which must be computed using the embeddings for all the words in the vocabulary, autoencoders could be run in an online fashion using only a small mini-batch of words at a time. Moreover, non-linear transformations and regularisation (e.g. in the from of dropout) can be easily incorporated into autoencoders, which can also be stacked for further post-processing. Although online [NIPS2006_3144, NIPS2013_5135, Feng:2013] and non-linear [Scholz_2005] variants of PCA have been proposed, they are less attractive due to computational complexity, scalability and the lack of availability in deep learning frameworks.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420062
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description