Incremental Sense Weight Training for theInterpretation of Contextualized Word Embeddings

Incremental Sense Weight Training for the Interpretation of Contextualized Word Embeddings


We present a novel online algorithm that learns the essence of each dimension in word embeddings by minimizing the within-group distance of contextualized embedding groups. Three state-of-the-art neural-based language models are used, Flair, ELMo, and BERT, to generate contextualized word embeddings such that different embeddings are generated for the same word type, which are grouped by their senses manually annotated in the SemCor dataset. We hypothesize that not all dimensions are equally important for downstream tasks so that our algorithm can detect unessential dimensions and discard them without hurting the performance. To verify this hypothesis, we first mask dimensions determined unessential by our algorithm, apply the masked word embeddings to a word sense disambiguation task (WSD), and compare its performance against the one achieved by the original embeddings. Several KNN approaches are experimented to establish strong baselines for WSD. Our results show that the masked word embeddings do not hurt the performance and can improve it by 3%. Our work can be used to conduct future research on the interpretability of contextualized embeddings.


1 Introduction

Contextualized word embeddings have played an essential role in many NLP tasks. One could expect considerable performance boosts by simply substituting distributional word embeddings with Flair Akbik et al. (2018), ELMo Peters et al. (2018), and BERT Devlin et al. (2019) embeddings. The unique thing about contextualized word embeddings is that different representations are generated for the same word type with different topical senses. This work focuses on interpreting embedding representations for word senses. We propose an algorithm (Section 3) that learns the dimension importance in representing sense information and then mask unessential dimensions that are deemed less meaningful in word sense representations to 0. The effectiveness of our approach is validated by a word sense disambiguation task (WSD) that aims to distinguish the correct senses of words under different contexts, as well as two intrinsic evaluations of embedding groups on the masked embeddings.

In addition to the final outputs of Flair, ELMo and BERT embeddings, hidden layer outputs from ELMo and BERT are also extracted and compared. Our results show that masking unessential dimensions of word embeddings does not impair the performance on WSD; moreover, discarding those dimensions can improve the performance up to 3%, which suggests a new method for embedding distillation for more efficient neural network modeling.

2 Related Work

2.1 Word Embedding Interpretibility

In the earlier work, Murphy et al. (2012) suggest a variant of sparse matrix factorization, which generates highly interpretable word representations. Based on that work, Jang and Myaeng (2017) introduce a method analyzing dimensions characterizing categories by linking concepts with types and comparing dimension values within concept groups with the average of dimension values within category groups. Works have also investigated ways to enrich embedding interpretability by modifying the training process of word embedding models Luo et al. (2015); Koç et al. (2018). Others make use of pre-trained embeddings and apply post-processing techniques to acquire embeddings with more interpretability. Past researches use matrix transformation methods on pre-trained embeddings Zobnin (2017); Park et al. (2017); Shin et al. (2018). Zobnin (2017) utilizes canonical orthogonal transformations to map current embeddings to a new vector space where the vectors are more interpretable.

Similarly, Park et al. (2017) proposes an approach that rotates pre-trained embedding by minimizing the complexity function, so that the dimensions after rotation become more interpretable. Another type of methods applies sparse encoding techniques on word embeddings and map them to sparse vectors Subramanian et al. (2018); Arora et al. (2018).

2.2 Contextualized Word Embedding Models

Three popular word embedding algorithms are used for our experiments with various dimensions: ELMo, Flair, and BERT. ELMo is a deep word-level bidirectional LSTM language model with character level convolution networks along with a final linear projection output layer Peters et al. (2018). Flair is a character-level bidirectional LSTM language model on sequences of characters Akbik et al. (2018). BERT has an architecture of a multi-layer bidirectional transformer encoder Devlin et al. (2019).

2.3 Word Sense Disambiguation (WSD)

This work uses WSD as the evaluation for the proposed algorithm, which is the task of determining which sense a target word belongs to in a sentence. This work adopts a supervised approach that makes use of sense-annotated training data. The Most Frequent Sense (MFS) heuristic is the most common baseline, which selects the most frequent sense in the training data for the target word Raganato et al. (2017). Depending on the evaluation dataset, the state-of-art in WSD varies. Raganato et al. (2017) utilize bi-LSTM networks with attention mechanism and a softmax layer. Melamud et al. (2016) and Peters et al. (2018) also adopt bi-LSTM networks with KNN classifiers. Later work incorporates word features such as gloss and POS information into memory networks Luo et al. (2018); Papandrea et al. (2017).

3 Sense Weight Training (SWT)

Given a large embedding dimension size, the hypothesis is that not every embedding dimension plays a role in representing a sense. Here we propose a new algorithm to determine the importance of dimensions. With word embedding groups classified by their senses annotated in the SemCor dataset Miller et al. (1994), the objective function in this algorithm is to maximize the average pair-wise cosine similarity in all sense groups. A weight matrix with the same size of the word embedding is initialized for each sense. Each dimension represents the importance of a specific dimension to that sense.

  for each sense group  do
     initialize weights , learning rate , Adagrad weights matrix
     for each epoch  do
        if  then
           randomly generate N numbers:
           generate N numbers based on policy:
        end if
     end for
  end for
Algorithm 1 Algorithm for Incremental Sense Weight Training

During training, a mask matrix is generated and applied to the weight matrix. The gradient of the algorithm is defined to be the difference between the current similarity score and the previous similarity score multiplied by the masking matrix subtracted by one. The weight matrix is updated during training with the gradients and a learning rate.

The mask matrix is the size of the weight matrix and has dimensions being zero and the rest being one. The generation of the mask matrix involves two phases. In the first phase, SWT randomly generates positions of zeros to ensure enough dimensions have been covered. After a certain number of epochs, the training enters the second phase where an exploration-exploitation policy is employed. The policy states that there is a chance of to randomly generate numbers. For the remaining possibility, the generation of numbers depends on the weight matrix: the higher the value of dimension in the weight matrix, the lower probability of the number getting selected. Furthermore, regularization is applied for feature selection purpose, and AdaGrad Duchi et al. (2011) is used to encourage convergence. Pseudo-code for SWT is in Algorithm 1, where is the number of epochs for exploration, the parameter for regularization and a small number to prevent zero denominators in AdaGrad. After the weights are learned, we set the value of embedding dimensions with low importance to zero and test if the rest dimensions are enough to represent the word sense group.

4 Experiments

Figure 1: BERT-Large embeddings with 24 hidden layers. Certain layers such as the last 10 layers perform better if 5% of the dimensions are masked.

Firstly, all the experiments using the original word embeddings are run. Then, using the trained weight matrix from Section 3, the same tests are run on masked embeddings again for comparison.

4.1 Datasets and Word Embeddings

Our proposed baselines and algorithms are trained on SemCor Miller et al. (1994) and evaluated on SenEval-2 Edmonds and Cotton (2001), SenEval-3 Snyder and Palmer (2004), SemEval’07 Pradhan et al. (2007), SemEval’13 Navigli et al. (2013) and SemEval’15 Moro and Navigli (2015).

Pre-trained contextualized word embeddings are exclusively used and compared. Pre-trained ELMo, BERT and Flair models are tested. The models include ELMo’s three models with dimension sizes of 256, 512 and 1,024 (all with 2-layer bi-LSTM), BERT’s 2 models: BERT-base with a dimension size of 768 and 12 output layers; BERT-Large with a dimension size of 1,024 and 24 layers, and Flair’s single-layer bi-LSTM models with dimension sizes of 2,048 and 4,096.

4.2 KNN Methods

K-Nearest Neighbor (KNN) approach is adopted from both ELMo Peters et al. (2018) and context2vec Melamud et al. (2016) to establish strong baseline approaches.

Sense-based KNN

Adapted from ELMo Peters et al. (2018) with , words that have the same senses are clustered together, and the average of that cluster is used as the sense vector, which is then fitted using a one KNN classifier. Unseen words from the test corpus fall back using the first sense from WordNet Fellbaum (1998).

Word-based KNN

Following context2vec Akbik et al. (2018), a cluster of each lemma occurrences in the training set is formed. Each word has a distinct classifier, which will assign labels based on , where . Unseen words from test corpus fall back using the first sense from WordNet.

4.3 Masked Embeddings

Each sense has a trained weight matrix from Section 3. We process the weight matrix by experimenting four percentages (5%, 10%, 15%, 20%) to find the best threshold to mask out dimensions: the embedding dimensions with weight value ranked below such percentage are marked 0. For evaluations, each target word tries all the masks of its appeared senses and selects the masking that produces the closest distance , where is the sum of the distances from the masked word to its -nearest neighbors.

ELMo* - - 69.0 -
context2vec* - 69.6 - -
Flair 63.7 61.4 60.0 55.1
ELMo 63.8 61.5 63.9 59.0
BERT 67.3 65.2 59.0 54.1
Table 1: Results using 4 proposed KNN methods described in Section 4.2. *: published results. WF: Word-based KNN with fall back using WordNet. W: Word-based. SF: Sense-based with fall back. S: Sense-based. Flair: forward and backward. ELMo: both biLM layers. BERT: concatenation of last 4 layers.

4.4 Results

Model Original Masked
Flair-4096 63.7 62.1
Flair-2048 60.5 60.7
BERT 67.3 64.5
ELMo 63.8 63.0
ELMo-256 61.5 62.3
ELMo-512 62.7 63.0
ELMo-1024 62.5 63.4
Table 2: Results for the original and embeddings with 5% dimensions masked.


As shown in Table 1, BERT-Large, and ELMo tend to achieve higher F1 using all four methods, and word-based KNN with fall back works better in general. Therefore, KNN-WF are used to conduct all subsequent tasks.

Masked Results

The embeddings from all output layers of ELMo, BERT and Flair are evaluated. Table 2 proves that for ELMo and Flair-2048, masking does not hurt the performance too much and for single layers, it even shows improvements. Figure 1 shows the results when 5% of the embeddings are masked. Half the embeddings are improved if being masked using the 5% threshold, especially the last 10 layer outputs. Surprisingly, the last layer output score is boosted by 3%.

Figure 2: BERT-Base embeddings with 12 layers. The blue columns are original and the green ones are results after 5% of the dimensions are masked.

In Figure 2, the first 3 layer outputs are improved by the masking with 5% threshold. Why the deeper layer outputs are not improved requires further research. In Figure 3, both 5% and 10% masking are reported because for layer 1-1 and 3-1 10% threshold works better than 5%. For the last layer of each model, the 5% threshold surpasses the performance of the original ones.

ELMo performances vary more with different output layers, compared to BERT. BERT-Base output layers exhibit more stable performances compared to the BERT-Large model. Furthermore, an interesting pattern for ELMo is that masking out 5% dimensions cause a more considerable performance drop for layers with worse original scores. One possible explanation is that embeddings from output layers closer to the input layer contain less insignificant dimensions.

Figure 3: ELMo embeddings with 3 models and 3 layers each. 1: 1024. 2: 512. 3: 256. Blue columns are original embeddings, green columns are when 5% of the dimensions are masked, and orange columns are when 10% of dimensions are masked.

Experiments have also been done for the Flair models, which show similar results that the performances remain stable after 5% dimensions of embeddings masked to zero, as shown in Table 4.4. In summary, masking 5% of the dimensions does not hurt the performance too much, and for half of them, masking helps improve the score by 3 percent at most. 10% threshold sometimes outperforms the 5% threshold in ELMo hidden layers.

(a) original
(b) masked (threshold of 0.5)
Figure 4: Graphs of 20 selected sense groups with 100 embeddings each for ELMo with a dimension size of 512 (third output layer). The projection of dimensions from 512 to 2 is done by Linear Discriminant Analysis.

4.5 Analysis

Further analysis is made to investigate the number of negligible dimensions in word embeddings. Figure 3(a) shows a projected graph of selected sense groups, each with 100 embeddings from one ELMo model. Figure 3(b) demonstrates the same word embeddings with the dimensions masked to 0 if their corresponding weights are smaller than 0.5. The masked groups display a smaller with-in group distance and a greater separation of sense groups.

The Spearman’s Rank-Order Correlation Coefficient between the pair-wise cosine similarity of sense vectors (average embedding of embedding groups classified by word senses) and the pair-wise path similarity scores between senses provided by WordNet Landes et al. (1998) is evaluated for the original word embeddings and the masked embeddings whose group sizes are larger than 100. Average pair-wise cosine similarity within sense groups is also calculated before and after. The table with all the test results is in the Appendix. Overall, the average cosine similarities within sense groups all increase after dimensions are masked out for all models, which proves that the dimension weights learn by our objective function. The correlation test shows no significant performance decrease (some even increase), which manifests that the masked dimensions do not contribute to the sense group relations.

Model Dim \textsubscriptmasked \textsubscriptoriginal \textsubscriptmasked
BERT 768 125 0.26814 0.26286
BERT 1024 146 0.27423 0.26575
ELMo 256 218 0.2852 0.3042
ELMo 512 281 0.29577 0.36943
ELMo 1024 608 0.28406 0.30675
Flair 2048 670 0.24891 0.28516
Table 3: Correlation coefficient test results for original and masked word embeddings with \textsubscriptmasked (avg. number of dimensions masked out). The BERT embeddings are from the second output layer, and the ELMo and the Flair models from the last output layer.

Table 3 contains the correlation test results and the number of dimensions masked out for BERT (the second to the last output layer), ELMo (the last output layer), and Flair. The number of dimensions masked is averaged throughout all sense groups. For ELMo and Flair, with the insignificant embedding dimensions masked out, the sense groups show a better correlation score. For the ELMo models, the number of embeddings that can be discarded increases with the distance of the output layer to the input layer. This result corresponds to ELMo’s claim that the embeddings with output layers closer to the input layer are semantically richer Peters et al. (2018).

Another pattern is that the verb sense groups tend to have less number of dimensions getting masked out because verb sense groups have more possible forms of tokens belonging to the same sense group. A table with relevant examples can be found in the Appendix. We also attempted to mask out embedding dimensions with higher weights. In other words, we kept only the masked dimensions in the evaluations above, to examine what information is in the discarded dimensions. We ranked the cosine similarities between masked embedding pairs and picked out the 100 top most similar ones, which fails to output any patterns and points the future research direction in this domain.

5 Conclusion

This paper demonstrates a novel approach to interpret word embeddings. Mainly focusing on context-based word embedding’s ability to distinguish and learn relationships in word senses, we propose an algorithm for learning the importance of dimension weights in sense groups. After training the weights for word dimensions, the dimensions with less importance are masked out and tested using a word sense disambiguation task and two other evaluations. A conclusion can be drawn from the results that some dimensions do not contribute to the representation of sense groups and our algorithm can distinguish the importance of them.

6 Appendix

Model Dim Out Sense Sense Sense
BERT 768 -2 ask.v.01 75 three.s.01 208 man.n.01 58
BERT 1024 -2 ask.v.01 40 three.s.01 211 man.n.01 116
ELMo 256 0 ask.v.01 78 three.s.01 182 man.n.01 242
ELMo 256 1 ask.v.01 78 three.s.01 99 man.n.01 212
ELMo 256 2 ask.v.01 241 three.s.01 243 man.n.01 212
ELMo 512 0 ask.v.01 103 three.s.01 28 man.n.01 295
ELMo 512 1 ask.v.01 144 three.s.01 105 man.n.01 156
ELMo 512 2 ask.v.01 334 three.s.01 300 man.n.01 311
ELMo 1024 0 ask.v.01 174 three.s.01 44 man.n.01 331
ELMo 1024 1 ask.v.01 60 three.s.01 106 man.n.01 220
ELMo 1024 2 ask.v.01 568 three.s.01 827 man.n.01 708
Flair 2048 -1 ask.v.01 193 three.s.01 862 man.n.01 1883
Table 4: the embedding models with specific word embedding sense groups (Sense) and the embedding dimension numbers masked out in according groups (): “ask.v.01” is a verb word sense with a meaning of “inquire about”; “three.s.01” is an adjective word sense with a meaning of “being one more than two”; “man.n.01” is a noun word sense with a meaning of “an adult person who is male (as opposed to a woman)”
Model Dim Out
BERT 768 -2 125 0.26814 0.26286 0.4971 0.5187
BERT 1024 -2 146 0.27423 0.26575 0.5811 0.5983
ELMo 256 0 95 0.12016 0.16017 0.5959 0.6134
ELMo 256 1 105 0.30903 0.37377 0.5119 0.5787
ELMo 256 2 218 0.2852 0.3042 0.4507 0.6914
ELMo 256 av 199 0.26553 0.36945 0.4932 0.6729
ELMo 512 0 136 0.17058 0.17336 0.5957 0.6051
ELMo 512 1 181 0.27967 0.25318 0.5414 0.5908
ELMo 512 2 281 0.29577 0.36943 0.4404 0.5346
ELMo 512 av 207 0.2949 0.30047 0.4930 0.5470
ELMo 1024 0 179 0.18504 0.17263 0.5930 0.5945
ELMo 1024 1 198 0.30897 0.30175 0.4783 0.4971
ELMo 1024 2 608 0.28406 0.30675 0.3927 0.4915
ELMo 1024 av 406 0.28331 0.27204 0.4542 0.5086
Flair 2048 -1 670 0.24891 0.28516 0.5560 0.6084
Table 5: Cosine similarity and correlation test results for unmasked and masked word embeddings: the embedding model (Model), dimension size (Dim), output layer (Out where av represents the average embedding of three output layers), the average number of dimensions that are masked to zero in embedding sense groups (), the correlation coefficient of original embedding sense group centers and sense relations (), the correlation coefficient of embedding sense group centers with dimensions masked to 0 (), the average within-group cosine similarity for the original embeddings () and the average within-group cosine similarity after the dimensions are masked out (). Only sense groups with a group size bigger than 100 are considered in this case.
Figure 5: Flair and Flair-Fast unmasked and masked result.


  1. Contextual String Embeddings for Sequence Labeling. In Proceedings of CICLing, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §1, §2.2, §4.2.
  2. Linear Algebraic Structure of Word Senses, with Applications to Polysemy. TACL 6, pp. 483–495. External Links: Link, 1601.03764 Cited by: §2.1.
  3. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL: HLT, Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.2.
  4. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. External Links: ISSN 1532-4435, Link Cited by: §3.
  5. SENSEVAL-2: overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, SENSEVAL ’01, Stroudsburg, PA, USA, pp. 1–5. External Links: Link Cited by: §4.1.
  6. WordNet: an electronic lexical database. Bradford Books. Cited by: §4.2.
  7. Elucidating conceptual properties from word embeddings. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 91–95. External Links: Document, Link Cited by: §2.1.
  8. Imparting interpretability to word embeddings. CoRR abs/1807.07279. External Links: Link, 1807.07279 Cited by: §2.1.
  9. Building Semantic Concordances. In WordNet: An Electronic Lexical Database, Vol. , pp. 199––216. External Links: Document, ISSN , ISBN 9780262272551, Link Cited by: §4.5.
  10. Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 2473–2482. External Links: Link, Document Cited by: §2.3.
  11. Online learning of interpretable word embeddings. In EMNLP, Cited by: §2.1.
  12. Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 51–61. External Links: Link, Document Cited by: §2.3, §4.2.
  13. Using a semantic concordance for sense identification. In Human Language Technology, Proceedings of a Workshop held at Plainsboro, New Jerey, USA, March 8-11, 1994, External Links: Link Cited by: §3, §4.1.
  14. SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 288–297. External Links: Link, Document Cited by: §4.1.
  15. Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING 2012, pp. 1933–1950. External Links: Link Cited by: §2.1.
  16. SemEval-2013 task 12: multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 222–231. External Links: Link Cited by: §4.1.
  17. SupWSD: a flexible toolkit for supervised word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Copenhagen, Denmark, pp. 103–108. External Links: Link, Document Cited by: §2.3.
  18. Rotated Word Vector Representations and their Interpretability. In Proceedings of EMNLP, Copenhagen, Denmark, pp. 401–411. External Links: Link, Document Cited by: §2.1, §2.1.
  19. Deep Contextualized Word Representations. In Proceedings of NAACL:HLT, New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §2.2, §2.3, §4.2, §4.2, §4.5.
  20. SemEval-2007 task 17: english lexical sample, srl and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, Stroudsburg, PA, USA, pp. 87–92. External Links: Link Cited by: §4.1.
  21. Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 99–110. External Links: Link Cited by: §2.3.
  22. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1156–1167. External Links: Link, Document Cited by: §2.3.
  23. Interpreting word embeddings with eigenvector analysis. In NIPS 2018 Interpretability and Robustness in Audio, Speech, and Language (IRASL) Workshop: 32nd Annual Conference on Neural Information Processing Systems, Montreal, Canada, Cited by: §2.1.
  24. The english all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, pp. 41–43. External Links: Link Cited by: §4.1.
  25. SPINE: SParse Interpretable Neural Embeddings. In Proceedings of AAAI, pp. 4921–4928. Cited by: §2.1.
  26. Rotations and Interpretability of Word Embeddings: the Case of the Russian Language. AIST 10716, pp. 116–128. External Links: Link, 1707.04662 Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description