Incremental Sense Weight Training for the Interpretation of Contextualized Word Embeddings
We present a novel online algorithm that learns the essence of each dimension in word embeddings by minimizing the within-group distance of contextualized embedding groups. Three state-of-the-art neural-based language models are used, Flair, ELMo, and BERT, to generate contextualized word embeddings such that different embeddings are generated for the same word type, which are grouped by their senses manually annotated in the SemCor dataset. We hypothesize that not all dimensions are equally important for downstream tasks so that our algorithm can detect unessential dimensions and discard them without hurting the performance. To verify this hypothesis, we first mask dimensions determined unessential by our algorithm, apply the masked word embeddings to a word sense disambiguation task (WSD), and compare its performance against the one achieved by the original embeddings. Several KNN approaches are experimented to establish strong baselines for WSD. Our results show that the masked word embeddings do not hurt the performance and can improve it by 3%. Our work can be used to conduct future research on the interpretability of contextualized embeddings.
Contextualized word embeddings have played an essential role in many NLP tasks. One could expect considerable performance boosts by simply substituting distributional word embeddings with Flair Akbik et al. (2018), ELMo Peters et al. (2018), and BERT Devlin et al. (2019) embeddings. The unique thing about contextualized word embeddings is that different representations are generated for the same word type with different topical senses. This work focuses on interpreting embedding representations for word senses. We propose an algorithm (Section 3) that learns the dimension importance in representing sense information and then mask unessential dimensions that are deemed less meaningful in word sense representations to 0. The effectiveness of our approach is validated by a word sense disambiguation task (WSD) that aims to distinguish the correct senses of words under different contexts, as well as two intrinsic evaluations of embedding groups on the masked embeddings.
In addition to the final outputs of Flair, ELMo and BERT embeddings, hidden layer outputs from ELMo and BERT are also extracted and compared. Our results show that masking unessential dimensions of word embeddings does not impair the performance on WSD; moreover, discarding those dimensions can improve the performance up to 3%, which suggests a new method for embedding distillation for more efficient neural network modeling.
2 Related Work
2.1 Word Embedding Interpretibility
In the earlier work, Murphy et al. (2012) suggest a variant of sparse matrix factorization, which generates highly interpretable word representations. Based on that work, Jang and Myaeng (2017) introduce a method analyzing dimensions characterizing categories by linking concepts with types and comparing dimension values within concept groups with the average of dimension values within category groups. Works have also investigated ways to enrich embedding interpretability by modifying the training process of word embedding models Luo et al. (2015); Koç et al. (2018). Others make use of pre-trained embeddings and apply post-processing techniques to acquire embeddings with more interpretability. Past researches use matrix transformation methods on pre-trained embeddings Zobnin (2017); Park et al. (2017); Shin et al. (2018). Zobnin (2017) utilizes canonical orthogonal transformations to map current embeddings to a new vector space where the vectors are more interpretable.
Similarly, Park et al. (2017) proposes an approach that rotates pre-trained embedding by minimizing the complexity function, so that the dimensions after rotation become more interpretable. Another type of methods applies sparse encoding techniques on word embeddings and map them to sparse vectors Subramanian et al. (2018); Arora et al. (2018).
2.2 Contextualized Word Embedding Models
Three popular word embedding algorithms are used for our experiments with various dimensions: ELMo, Flair, and BERT. ELMo is a deep word-level bidirectional LSTM language model with character level convolution networks along with a final linear projection output layer Peters et al. (2018). Flair is a character-level bidirectional LSTM language model on sequences of characters Akbik et al. (2018). BERT has an architecture of a multi-layer bidirectional transformer encoder Devlin et al. (2019).
2.3 Word Sense Disambiguation (WSD)
This work uses WSD as the evaluation for the proposed algorithm, which is the task of determining which sense a target word belongs to in a sentence. This work adopts a supervised approach that makes use of sense-annotated training data. The Most Frequent Sense (MFS) heuristic is the most common baseline, which selects the most frequent sense in the training data for the target word Raganato et al. (2017). Depending on the evaluation dataset, the state-of-art in WSD varies. Raganato et al. (2017) utilize bi-LSTM networks with attention mechanism and a softmax layer. Melamud et al. (2016) and Peters et al. (2018) also adopt bi-LSTM networks with KNN classifiers. Later work incorporates word features such as gloss and POS information into memory networks Luo et al. (2018); Papandrea et al. (2017).
3 Sense Weight Training (SWT)
Given a large embedding dimension size, the hypothesis is that not every embedding dimension plays a role in representing a sense. Here we propose a new algorithm to determine the importance of dimensions. With word embedding groups classified by their senses annotated in the SemCor dataset Miller et al. (1994), the objective function in this algorithm is to maximize the average pair-wise cosine similarity in all sense groups. A weight matrix with the same size of the word embedding is initialized for each sense. Each dimension represents the importance of a specific dimension to that sense.
During training, a mask matrix is generated and applied to the weight matrix. The gradient of the algorithm is defined to be the difference between the current similarity score and the previous similarity score multiplied by the masking matrix subtracted by one. The weight matrix is updated during training with the gradients and a learning rate.
The mask matrix is the size of the weight matrix and has dimensions being zero and the rest being one. The generation of the mask matrix involves two phases. In the first phase, SWT randomly generates positions of zeros to ensure enough dimensions have been covered. After a certain number of epochs, the training enters the second phase where an exploration-exploitation policy is employed. The policy states that there is a chance of to randomly generate numbers. For the remaining possibility, the generation of numbers depends on the weight matrix: the higher the value of dimension in the weight matrix, the lower probability of the number getting selected. Furthermore, regularization is applied for feature selection purpose, and AdaGrad Duchi et al. (2011) is used to encourage convergence. Pseudo-code for SWT is in Algorithm 1, where is the number of epochs for exploration, the parameter for regularization and a small number to prevent zero denominators in AdaGrad. After the weights are learned, we set the value of embedding dimensions with low importance to zero and test if the rest dimensions are enough to represent the word sense group.
Firstly, all the experiments using the original word embeddings are run. Then, using the trained weight matrix from Section 3, the same tests are run on masked embeddings again for comparison.
4.1 Datasets and Word Embeddings
Our proposed baselines and algorithms are trained on SemCor Miller et al. (1994) and evaluated on SenEval-2 Edmonds and Cotton (2001), SenEval-3 Snyder and Palmer (2004), SemEval’07 Pradhan et al. (2007), SemEval’13 Navigli et al. (2013) and SemEval’15 Moro and Navigli (2015).
Pre-trained contextualized word embeddings are exclusively used and compared. Pre-trained ELMo, BERT and Flair models are tested. The models include ELMo’s three models with dimension sizes of 256, 512 and 1,024 (all with 2-layer bi-LSTM), BERT’s 2 models: BERT-base with a dimension size of 768 and 12 output layers; BERT-Large with a dimension size of 1,024 and 24 layers, and Flair’s single-layer bi-LSTM models with dimension sizes of 2,048 and 4,096.
4.2 KNN Methods
Adapted from ELMo Peters et al. (2018) with , words that have the same senses are clustered together, and the average of that cluster is used as the sense vector, which is then fitted using a one KNN classifier. Unseen words from the test corpus fall back using the first sense from WordNet Fellbaum (1998).
Following context2vec Akbik et al. (2018), a cluster of each lemma occurrences in the training set is formed. Each word has a distinct classifier, which will assign labels based on , where . Unseen words from test corpus fall back using the first sense from WordNet.
4.3 Masked Embeddings
Each sense has a trained weight matrix from Section 3. We process the weight matrix by experimenting four percentages (5%, 10%, 15%, 20%) to find the best threshold to mask out dimensions: the embedding dimensions with weight value ranked below such percentage are marked 0. For evaluations, each target word tries all the masks of its appeared senses and selects the masking that produces the closest distance , where is the sum of the distances from the masked word to its -nearest neighbors.
As shown in Table 1, BERT-Large, and ELMo tend to achieve higher F1 using all four methods, and word-based KNN with fall back works better in general. Therefore, KNN-WF are used to conduct all subsequent tasks.
The embeddings from all output layers of ELMo, BERT and Flair are evaluated. Table 2 proves that for ELMo and Flair-2048, masking does not hurt the performance too much and for single layers, it even shows improvements. Figure 1 shows the results when 5% of the embeddings are masked. Half the embeddings are improved if being masked using the 5% threshold, especially the last 10 layer outputs. Surprisingly, the last layer output score is boosted by 3%.
In Figure 2, the first 3 layer outputs are improved by the masking with 5% threshold. Why the deeper layer outputs are not improved requires further research. In Figure 3, both 5% and 10% masking are reported because for layer 1-1 and 3-1 10% threshold works better than 5%. For the last layer of each model, the 5% threshold surpasses the performance of the original ones.
ELMo performances vary more with different output layers, compared to BERT. BERT-Base output layers exhibit more stable performances compared to the BERT-Large model. Furthermore, an interesting pattern for ELMo is that masking out 5% dimensions cause a more considerable performance drop for layers with worse original scores. One possible explanation is that embeddings from output layers closer to the input layer contain less insignificant dimensions.
Experiments have also been done for the Flair models, which show similar results that the performances remain stable after 5% dimensions of embeddings masked to zero, as shown in Table 4.4. In summary, masking 5% of the dimensions does not hurt the performance too much, and for half of them, masking helps improve the score by 3 percent at most. 10% threshold sometimes outperforms the 5% threshold in ELMo hidden layers.
Further analysis is made to investigate the number of negligible dimensions in word embeddings. Figure 3(a) shows a projected graph of selected sense groups, each with 100 embeddings from one ELMo model. Figure 3(b) demonstrates the same word embeddings with the dimensions masked to 0 if their corresponding weights are smaller than 0.5. The masked groups display a smaller with-in group distance and a greater separation of sense groups.
The Spearman’s Rank-Order Correlation Coefficient between the pair-wise cosine similarity of sense vectors (average embedding of embedding groups classified by word senses) and the pair-wise path similarity scores between senses provided by WordNet Landes et al. (1998) is evaluated for the original word embeddings and the masked embeddings whose group sizes are larger than 100. Average pair-wise cosine similarity within sense groups is also calculated before and after. The table with all the test results is in the Appendix. Overall, the average cosine similarities within sense groups all increase after dimensions are masked out for all models, which proves that the dimension weights learn by our objective function. The correlation test shows no significant performance decrease (some even increase), which manifests that the masked dimensions do not contribute to the sense group relations.
Table 3 contains the correlation test results and the number of dimensions masked out for BERT (the second to the last output layer), ELMo (the last output layer), and Flair. The number of dimensions masked is averaged throughout all sense groups. For ELMo and Flair, with the insignificant embedding dimensions masked out, the sense groups show a better correlation score. For the ELMo models, the number of embeddings that can be discarded increases with the distance of the output layer to the input layer. This result corresponds to ELMo’s claim that the embeddings with output layers closer to the input layer are semantically richer Peters et al. (2018).
Another pattern is that the verb sense groups tend to have less number of dimensions getting masked out because verb sense groups have more possible forms of tokens belonging to the same sense group. A table with relevant examples can be found in the Appendix. We also attempted to mask out embedding dimensions with higher weights. In other words, we kept only the masked dimensions in the evaluations above, to examine what information is in the discarded dimensions. We ranked the cosine similarities between masked embedding pairs and picked out the 100 top most similar ones, which fails to output any patterns and points the future research direction in this domain.
This paper demonstrates a novel approach to interpret word embeddings. Mainly focusing on context-based word embeddingâs ability to distinguish and learn relationships in word senses, we propose an algorithm for learning the importance of dimension weights in sense groups. After training the weights for word dimensions, the dimensions with less importance are masked out and tested using a word sense disambiguation task and two other evaluations. A conclusion can be drawn from the results that some dimensions do not contribute to the representation of sense groups and our algorithm can distinguish the importance of them.
- Contextual String Embeddings for Sequence Labeling. In Proceedings of CICLing, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Cited by: §1, §2.2, §4.2.
- Linear Algebraic Structure of Word Senses, with Applications to Polysemy. TACL 6, pp. 483–495. External Links: Cited by: §2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL: HLT, Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §2.2.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. External Links: Cited by: §3.
- SENSEVAL-2: overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, SENSEVAL ’01, Stroudsburg, PA, USA, pp. 1–5. External Links: Cited by: §4.1.
- WordNet: an electronic lexical database. Bradford Books. Cited by: §4.2.
- Elucidating conceptual properties from word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 91–95. External Links: Cited by: §2.1.
- Imparting interpretability to word embeddings. CoRR abs/1807.07279. External Links: Cited by: §2.1.
- Building Semantic Concordances. In WordNet: An Electronic Lexical Database, Vol. , pp. 199â–216. External Links: Cited by: §4.5.
- Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 2473–2482. External Links: Cited by: §2.3.
- Online learning of interpretable word embeddings. In EMNLP, Cited by: §2.1.
- Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 51–61. External Links: Cited by: §2.3, §4.2.
- Using a semantic concordance for sense identification. In Human Language Technology, Proceedings of a Workshop held at Plainsboro, New Jerey, USA, March 8-11, 1994, External Links: Cited by: §3, §4.1.
- SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 288–297. External Links: Cited by: §4.1.
- Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING 2012, pp. 1933–1950. External Links: Cited by: §2.1.
- SemEval-2013 task 12: multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 222–231. External Links: Cited by: §4.1.
- SupWSD: a flexible toolkit for supervised word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Copenhagen, Denmark, pp. 103–108. External Links: Cited by: §2.3.
- Rotated Word Vector Representations and their Interpretability. In Proceedings of EMNLP, Copenhagen, Denmark, pp. 401–411. External Links: Cited by: §2.1, §2.1.
- Deep Contextualized Word Representations. In Proceedings of NAACL:HLT, New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §1, §2.2, §2.3, §4.2, §4.2, §4.5.
- SemEval-2007 task 17: english lexical sample, srl and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, Stroudsburg, PA, USA, pp. 87–92. External Links: Cited by: §4.1.
- Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 99–110. External Links: Cited by: §2.3.
- Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1156–1167. External Links: Cited by: §2.3.
- Interpreting word embeddings with eigenvector analysis. In NIPS 2018 Interpretability and Robustness in Audio, Speech, and Language (IRASL) Workshop: 32nd Annual Conference on Neural Information Processing Systems, Montreal, Canada, Cited by: §2.1.
- The english all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, pp. 41–43. External Links: Cited by: §4.1.
- SPINE: SParse Interpretable Neural Embeddings. In Proceedings of AAAI, pp. 4921–4928. Cited by: §2.1.
- Rotations and Interpretability of Word Embeddings: the Case of the Russian Language. AIST 10716, pp. 116–128. External Links: Cited by: §2.1.