Multiple Word Embeddings for Increased Diversity of Representation

Multiple Word Embeddings for Increased Diversity of Representation


Most state-of-the-art models in natural language processing (NLP) are neural models built on top of large, pre-trained, contextual language models that generate representations of words in context and are fine-tuned for the task at hand. The improvements afforded by these “contextual embeddings” come with a high computational cost. In this work, we explore a simple technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We concatenate multiple pre-trained embeddings to strengthen our representation of words. We show that this concatenation technique works across many tasks, datasets, and model types. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity between different pre-trained embeddings is the driving force of why this technique works. We provide open source implementations of our models in both TensorFlow and PyTorch.


1 Introduction

Much of the recent work in NLP has focused on better feature representations via contextual word embeddings Peters et al. (2018, 2017); Radford et al. (2018); Akbik et al. (2018); Devlin et al. (2018). These models vary in architecture and pre-training objective but they all encode the input based on the surrounding context in some way. These papers normally compare to baselines like a bidirectional LSTM-CRF (biLSTM-CRF) where words are represented by a single pre-trained word embedding.

Peters et al. (2018, 2017) and Akbik et al. (2018) pre-train large language models based on LSTMs. Task-specific architectures are then built on top of these pre-trained models. Peters et al. (2018) introduce a technique for extracting word representations as a linear combination of layers in the pre-trained model. Gradient updates are only applied to this weighting factor, which simplifies the training to some extent, but forward propagation is still required for the full network which makes the model slow to train and evaluate.

Radford et al. (2018), followed by Devlin et al. (2018), pre-train deep transformers Vaswani et al. (2017) on massive corpora. They both use a simple output layer on top of the pre-trained model and tune the parameters of the whole model. In this case, training requires the forward and backward pass of the entire pre-trained model, which has a significant impact on size and speed. Devlin et al. (2018) used specialized hardware which may be unrealistic for many inference scenarios.

The prevailing wisdom is that, because these pre-trained models are contextual, they can create representations of a word that is different in different contexts. For example, a polysemous word can be represented by different vectors when its context suggests a different sense of a word, while context-independent word vectors need to represent a mix of all the senses of a word. The majority of NLP models have a similar “contextualization” step, typically done via a biLSTM, convolutional layers, or self-attention, but it is only learned from a smaller, task-specific corpus in contrast to the massive corpora used by contextual embeddings.

Contextual embeddings and transfer learning architectures are slow to train and evaluate, which may make them infeasible for many types of deployments. Using multiple pre-trained embeddings trained on different datasets, we can exploit the bias in different datasets that results in different representations of the same word. By combining these embeddings, we can create richer representations of the word without the high computational overhead required by contextual alternatives. We find that the concatenation of multiple pre-trained word embeddings show consistent improvements over single embeddings yielding results much closer to contextual alternatives.

2 Experiments & Results

Task Dataset Model Embeddings mean std min max
NER CoNLL biLSTM-CRF 6B 91.12 0.21 90.62 91.37
Senna 90.48 0.27 90.02 90.81
6B, Senna 91.61 0.25 91.15 92.00
WNUT-17 biLSTM-CRF 27B 39.20 0.71 37.98 40.33
27B, w2v-30M 39.52 0.83 38.09 40.39
27B, w2v-30M, 840B 40.33 1.13 38.38 41.99
OntoNotes biLSTM-CRF 6B 87.02 9.15 86.75 87.24
6B, Senna 87.41 0.16 87.14 87.74
Slot Filling Snips biLSTM-CRF 6B 95.84 0.29 95.39 96.21
GN 95.28 0.41 94.51 95.81
6B, GN 96.04 0.28 95.39 96.35
POS TW-POS biLSTM-CRF w2v-30M 89.21 0.28 88.72 89.74
27B 89.63 0.19 89.35 89.92
27B, w2v-30M 90.35 0.20 89.99 90.60
27B, w2v-30M, 840B 90.75 0.14 90.53 91.02
Classification SST2 LSTM 840B 88.39 0.45 87.42 89.07
GN 87.58 0.54 86.16 88.19
840B, GN 88.57 0.44 87.59 89.24
AG-NEWS LSTM 840B 92.53 0.45 87.42 89.07
GN 92.20 0.18 91.80 92.40
840B, GN 92.60 0.20 92.30 92.86
Snips Conv 840B 97.47 0.33 97.01 97.86
GN 97.40 0.27 97.00 97.86
840B, GN 97.63 0.52 97.00 98.29
Table 1: Results using multiple embeddings applied to several tasks and datasets. NER and Slot Filling tasks report entity-level F1. POS tagging and Classification report token-level and example-level accuracy respectively. Using multiple pre-trained embeddings helps across a wide range of tasks and datasets as well as across different model architectures within a given task. All results are reported across 10 runs.

We use three sequential prediction tasks to test the performance of our concatenated embeddings: NER (CoNLL 2003 Tjong Kim Sang and De Meulder (2003), WNUT-17 Derczynski et al. (2017), and OntoNotes Hovy et al. (2006)), Slot filling (Snips Coucke et al. (2018)) and POS tagging (TW-POS 1). We also show results on three classification datasets: SST2 Socher et al. (2013), Snips intent classification Coucke et al. (2018), and AG-News2. For each (task, dataset) pair we use the most common embedding used in literature, for example, GloVe embeddings were used for CONLL 2003 in Ma and Hovy (2016) and Senna embeddings in Chiu and Nichols (2016); Peters et al. (2018). Embeddings were also chosen based on how well the embedding training data fit the task, i.e., we used GloVe vectors trained on twitter for the twitter part of speech tagging task. Once we developed tests for which embeddings worked together in Section 3 we checked if there were any more embeddings combinations we should try but did not find any additional combinations. For all tagging tasks, a biLSTM-CRF model with convolutional character compositional inputs, following Ma and Hovy (2016), is used. For all classification tasks, a single layer LSTM model is used except for the Snips classification dataset, where a convolutional word-based model Kim (2014) is used. The hyperparameters are omitted here for brevity but can be found in our implementation.

The results are presented in Table 1. 6B, 27B and 840B are well-known, pre-trained GloVe embeddings Pennington et al. (2014) distributed via the authors site, w2v-30M Pressel et al. (2018) and GN Mikolov et al. (2013) are Word2Vec embeddings trained on a corpus of 30 million tweets and Google News respectively, and the Senna embeddings were trained by Collobert et al. (2011).

We leverage multiple pre-trained embeddings in a model by creating one embeddings table per pre-trained embedding. Each input token in embedded into each vector space and the resulting vectors are concatenated into a single vector. This means that it is possible for there to be a type that is unattested in one pre-trained embedding vocabulary but present in the other. This results in a pre-trained vector from one embedding being concatenated with a randomly initialized vector form the other embedding space.

As hypothesized, we see improvements across tasks, datasets, and model architectures when using multiple embeddings.

Models using the concatenation of pre-trained and randomly initialized embeddings do % worse on average compared to models that only use a single pre-trained embedding. This demonstrates that the performance gains are from the combination of different pre-trained embeddings rather than the increase in the number of parameters in the model. In some cases we were able to improve results further by adding several sets of additional embeddings.

Task Domain
NER General NER 0.51
Slot Filling Automotive 0.14
Cyber Security 0.06
Customer Service 0.34
Intent Automotive 0.52
Cyber Security 0.03
Customer Service 0.16
Table 2: Performance using multiple embeddings on internal datasets. Although smaller than well-known datasets, we see consistent improvements across internal tasks and domains.
Task Dataset
NER CoNLL 0.54
WNUT-17 2.88
OntoNotes 0.45
Slot Filling Snips 0.21
Classification SST2 0.20
AG-NEWS 0.08
Snips 0.16
Table 3: Relative difference for well-known datasets to help frame the results in Table 2

Table 2 summarizes the results of using the multiple embedding approach on internal datasets. These datasets are drawn from the tasks defined earlier and span a variety of specialized domains. Due to the nature of the datasets the results are presented as the relative change in performance. Table 3 is provided to help frame the relative performance numbers from the internal datasets.

The models were trained with MEAD/Baseline Pressel et al. (2018), an open-source framework for developing, training, and deploying NLP models.

3 Analysis

Dataset Embeddings mean std max
SST2 GN 87.58 0.54 88.19
GN + Random init 87.62 0.22 88.64
GN + 840B complement to GN 87.72 0.23 88.02
GN + 840B matched to GN 88.53 0.55 89.45
GN + 840B 88.57 0.44 89.24
CoNLL 6B 91.12 0.21 91.37
6B + Random init 90.77 0.17 91.11
6B + Senna complement to 6B 90.73 0.29 91.19
6B + Senna matched to 6B 91.47 0.18 91.78
6B + Senna 91.61 0.25 92.00
Table 4: An ablation to explain why multiple embeddings work. The majority of the improvement comes the case where we take only the words from the second pre-trained embedding that appear in the first vocab (the matched row). This suggests that having different representations for a word is much more important than increased model capacity (tested in the Random init row) or the increased coverage in the pre-trained vocabulary (represented by the complement row).
Overlap Attested Performance
Embeddings train dev train dev mean std
Senna 18.9 20.8 74.3 80.3 91.610 0.247
GloVe twitter 27B 24.9 27.2 68.1 76.1 91.098 0.135
GloVe 840B 41.7 40.6 83.2 88.5 91.011 0.228
GloVe 42B 45.5 45.3 90.4 93.8 91.163 0.146
GoogleNews 25.2 26.8 55.9 65.1 90.948 0.180
Table 5: Embedding similarity as defined by average Jaccard similarity of the 10 nearest neighbors on the top 200 words in CoNLL 2003. Performance is the entity-level F1 score of each embedding when paired with Glove 6B 100 dimension embeddings. Here we can see that using pairs of dissimilar embeddings correlate with better performance as long as the embeddings have enough coverage to be effectively leveraged.

There are three logical places where the observed improvements could come from. 1) The use of multiple pre-trained embeddings creates a slightly larger model, increasing the network capacity—the embeddings are larger and therefore the projection from the embeddings to the first layer of the model will also be slightly bigger. 2) The use of a second pre-trained embedding increases the vocabulary size and more words are attested. A word that has a pre-trained representation will start the model in a better spot than a randomly initialized representation. 3) The second set of pre-trained embeddings gives a different perspective of the words. Most pre-trained embeddings are trained on different data and encode different biases and senses into the embedding that reflect the quirks and unique contexts found in the pre-training data. This representational diversity will allow a model to capitalize on different senses, or the combination of senses, that would not be present when using a single embedding.

In order to tease apart which of these factors are at play we designed a series of models that aim to isolate each effect and report results in Table 4. First, we train a model that uses a single pre-trained embedding and a second set of vectors that are initialized randomly. If the main improvement is due to increased model capacity this configuration should perform well. The second model uses a special version of the second pre-trained embedding where we remove all the words that already appear in the original pre-trained vocabulary. In this second set of embeddings, randomly initialized vectors are used for the words that are covered in the original vocabulary in order to keep the embeddings size consistent with the previous model. If the main reason for improvement is the increased vocabulary coverage, this model should perform well. The final version of the model also uses a customized version of the second pre-trained embedding. This time we only keep embeddings that are already represented in the original vocabulary. This is designed to test if the main source of improvement is the difference in the representations each pre-trained embedding brings to the table.

From our ablation studies using the above variations on both the SST2 and CoNLL datasets and find that the most important thing is the representational diversity in the pre-trained embeddings. This dovetails nicely with our observation that embeddings trained on distinct datasets tend to perform well together. To further test this hypothesis, we look at the “similarity” of various pre-trained embeddings. We define “similarity” using the overlap of nearest neighbors in the embedding space as in Wendlandt et al. (2018). Specificly we use the average Jaccard overlap percentage between the 10 nearest neighbors for each of the top 200 words in the dataset by frequency. Table 5 shows the overlap of different embeddings with the Glove 6B 100 dimension embedding and how their combination affects the performance. As it can be seen, Senna has the lowest overlap and causes the biggest performance gain.

However, this does not hold for the GoogleNews embedding which also has a low overlap yet the combination actually causes a drop in performance. This can be explained by coverage—the percentage of unique types in the data that are attested in the pre-trained vocabulary. That number is surprisingly low for GoogleNews and causes the GoogleNews representations to be used so rarely they actually cause a drop in performance.

In summary, one should look for two characteristics when combining embeddings: the word representations should have low “similarity” and the unique types in the dataset should be highly attested in both pre-trained vocabularies.

4 Conclusion

Recent large-scale, contextual, pre-trained models are exciting but produce relatively slow models. We propose a simple, lightweight technique: concatenation of pre-trained embeddings. We show that this technique has a significant impact on error reduction and a negligible effect of speed.

However, the concatenation on any two random pre-trained embeddings is not guaranteed to work well. From our analysis, we are able to suggest a recipe for finding an effective combination: there should be a high degree of coverage of the unique types in each of the pre-trained embedding vocabularies and the word vectors should exhibit representational diversity. In future work, we intend to try other methods of embeddings combination while remaining computationally cheap. We also plan to find more principled ways to quantify the diversity in pre-trained embeddings, which can suggest ways to induce representational diversity into the embedding pre-training procedure itself.

Appendix A Reproducibility

a.1 Hyperparameters

Mead/Baseline is a configuration file driven model training framework. All hyperparameters are fully specified in the congifuration files included with the source code for our experiments.

a.2 Computational Resources

All models were trained on a single NVIDIA 1080Ti. While multiple GPUs were used for training many models in parallel to facilitate a testing many datasets and to estimate the variability of the method the actual model can easily be trained on a single GPU.

a.3 Evaluation

To calculate metrics, entity-level F1 is used for NER and slot-filling. In entity level F1 first entities are created from the token level labels and compared to the gold ones. Entities that match on both type and boundaries are considered correct while a mismatch in either causes an error. The F1 score is then calculated from these entities. Accuracy is used for classification and part of speech tagging. Accuracy is defined as the proportion of correct elements to all elements. In classification a single example is an element. In part of speech tagging each token is an element so our accuracy is the the number of correct tokens divided by the number of tokens in the dataset. We use the evaluation code that ships with the framework we use, MEAD/Baseline, which we have bundled with the source code of our experiments.

Task Dataset Model Embeddings Number of parameters
NER CoNLL biLSTM-CRF 6B 3,234,440
Senna 1,810,690
6B, Senna 4,658,190
WNUT-17 biLSTM-CRF 27B 3,849,632
27B, w2v-30M 6,499,532
27B, w2v-30M, 840B 12,090,032
OntoNotes biLSTM-CRF 6B 5,569,382
6B, Senna 7,673,632
Slot Filling Snips biLSTM-CRF 6B 1,819,466
GN 4,567,066
6B, GN 5,940,866
POS TW-POS biLSTM-CRF w2v-30M 1,241,332
27B 1,788,982
27B, w2v-30M 2,908,132
27B, w2v-30M, 840B 5,408,332
Classification SST2 LSTM 840B 6,456,702
GN 6,456,702
840B, GN 12,109,002
AG-NEWS LSTM 840B 20,842,604
GN 20,842,604
840B, GN 41,522,804
Snips Conv 840B 4,003,807
GN 4,003,807
840B, GN 8,005,207
Table 6: The number of parameters for different models.

a.4 Dataset Information

Dataset Train Dev Test Total
CoNLL Examples 14,987 3,466 3674 22137
Tokens 204,567 51,578 46,666 302,811
WNUT-17 Examples 3,394 1,009 1,287 5,690
Tokens 62,730 15,733 23,394 101,857
OntoNotes Examples 59,924 8,528 8,262 76,714
Tokens 1,088,503 147,724 152,728 1,388,955
Snips Examples 13,084 700 700 14,484
Tokens 117,700 6,384 6,354 130,438
TW-POS Examples 1,000 327 500 1,827
Tokens 14,619 4,823 7,152 26,594
SST2 Examples 76,961 872 1,821 79,654
Tokens 717,127 17,046, 35,023 769,196
AG-NEWS Examples 110,000 10,000 7,600 127,600
tokens 4,806,909 433,659 329,617 5,570,185
Table 7: Example and token count statistics for public datasets used.

Relevant information about datasets can be found in Table 7. The majority of data is used as distributed except we convert NER and slot-filling datasets to the IOBES format. All public dataset used are included in the supplementary material. A quick overview of each dataset follows:

CoNLL: A NER dataset based on news text. We converted the IOB labels into the IOBES format. There are 4 entity types, MISC, LOC, PER, and LOC.

WNUT-17: A NER dataset of new and emerging entities based on noisy user text. We converted the BIO labels into the IOBES format. There are 6 entity types, corporation, creative-work, group, location, person, and product.

OntoNotes: A much larger NER dataset. We converted the labels into the IOBES format. There are 18 entity types, CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, and WORK_OF_ART.

Snips: A slot-filling dataset focusing on commands one would give a virtual assistant. We converted the dataset from its normal format of two associated files, one containing surface terms and one containing labels to the more standard CoNLL file format and converted the labels to the IOBES format. There are 39 entity types, album, artist, best_rating, city, condition_description, condition_temperature, country, cuisine, current_location, entity_name, facility, genre, geographic_poi, location_name, movie_name, movie_type, music_item, object_location_type, object_name, object_part_of_series_type, object_select, object_type, party_size_description, party_size_number, playlist, playlist_owner, poi, rating_unit, rating_value, restaurant_name, restaurant_type, served_dish, service, sort, spatial_relation, state, timeRange, track, and year.

TW-POS: A twitter part of speech dataset. There are 25 parts of speech, !, #, $, &, ,, @, A, D, E, G, L, M, N, O, P, R, S, T, U, V, X, Y, Z, ^, and ~.

SST2: A binary sentiment analysis dataset based on movie reviews. We use the version where the training data is made up of phrases.

AG-NEWS: A four class text classification dataset for categorizing news data based on the 4 most common categories. There is not a standardized train and development split (there is a defined test set) so we created our own split which is included in the supplementary material.

Snips-Intent: The intent classification portion of the snips dataset. Again the intents pertain to requests one would make to a virtual assitant. There are 7 intents, SearchScreeningEvent, PlayMusic, AddToPlaylist, BookRestaurant, RateBook, SearchCreativeWork, and GetWeather.


  1. footnotemark:


  1. Cited by: §2.
  2. Contextual string embeddings for sequence labeling. In COLING, Cited by: §1, §1.
  3. Named entity recognition with bidirectional lstm-cnns. TACL 4, pp. 357–370. Cited by: §2.
  4. Natural language processing (almost) from scratch.. Journal of Machine Learning Research 12, pp. 2493–2537. External Links: Link Cited by: §2.
  5. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §2.
  6. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy, User-generated Text, Cited by: §2.
  7. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §1.
  8. OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, Stroudsburg, PA, USA, pp. 57–60. External Links: Link Cited by: §2.
  9. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Link, Document Cited by: §2.
  10. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link Cited by: §2.
  11. Efficient estimation of word representations in vector space. External Links: Link Cited by: §2.
  12. GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §2.
  13. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1756–1765. External Links: Link, Document Cited by: §1, §1.
  14. Deep contextualized word representations. In Proc. of NAACL, Cited by: §1, §1, §2.
  15. Baseline: a library for rapid modeling, experimentation and development of deep learning algorithms targeting nlp. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 34–40. External Links: Link Cited by: §2, §2.
  16. Improving language understanding by generative pre-training. External Links: Link Cited by: §1, §1.
  17. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. External Links: Link Cited by: §2.
  18. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Link, Document Cited by: §2.
  19. Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1.
  20. Factors influencing the surprising instability of word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2092–2102. External Links: Link, Document Cited by: §3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description