Multiple Word Embeddings for Increased Diversity of Representation
Most state-of-the-art models in natural language processing (NLP) are neural models built on top of large, pre-trained, contextual language models that generate representations of words in context and are fine-tuned for the task at hand. The improvements afforded by these “contextual embeddings” come with a high computational cost. In this work, we explore a simple technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We concatenate multiple pre-trained embeddings to strengthen our representation of words. We show that this concatenation technique works across many tasks, datasets, and model types. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity between different pre-trained embeddings is the driving force of why this technique works. We provide open source implementations of our models in both TensorFlow and PyTorch.
Much of the recent work in NLP has focused on better feature representations via contextual word embeddings Peters et al. (2018, 2017); Radford et al. (2018); Akbik et al. (2018); Devlin et al. (2018). These models vary in architecture and pre-training objective but they all encode the input based on the surrounding context in some way. These papers normally compare to baselines like a bidirectional LSTM-CRF (biLSTM-CRF) where words are represented by a single pre-trained word embedding.
Peters et al. (2018, 2017) and Akbik et al. (2018) pre-train large language models based on LSTMs. Task-specific architectures are then built on top of these pre-trained models. Peters et al. (2018) introduce a technique for extracting word representations as a linear combination of layers in the pre-trained model. Gradient updates are only applied to this weighting factor, which simplifies the training to some extent, but forward propagation is still required for the full network which makes the model slow to train and evaluate.
Radford et al. (2018), followed by Devlin et al. (2018), pre-train deep transformers Vaswani et al. (2017) on massive corpora. They both use a simple output layer on top of the pre-trained model and tune the parameters of the whole model. In this case, training requires the forward and backward pass of the entire pre-trained model, which has a significant impact on size and speed. Devlin et al. (2018) used specialized hardware which may be unrealistic for many inference scenarios.
The prevailing wisdom is that, because these pre-trained models are contextual, they can create representations of a word that is different in different contexts. For example, a polysemous word can be represented by different vectors when its context suggests a different sense of a word, while context-independent word vectors need to represent a mix of all the senses of a word. The majority of NLP models have a similar “contextualization” step, typically done via a biLSTM, convolutional layers, or self-attention, but it is only learned from a smaller, task-specific corpus in contrast to the massive corpora used by contextual embeddings.
Contextual embeddings and transfer learning architectures are slow to train and evaluate, which may make them infeasible for many types of deployments. Using multiple pre-trained embeddings trained on different datasets, we can exploit the bias in different datasets that results in different representations of the same word. By combining these embeddings, we can create richer representations of the word without the high computational overhead required by contextual alternatives. We find that the concatenation of multiple pre-trained word embeddings show consistent improvements over single embeddings yielding results much closer to contextual alternatives.
2 Experiments & Results
|27B, w2v-30M, 840B||40.33||1.13||38.38||41.99|
|27B, w2v-30M, 840B||90.75||0.14||90.53||91.02|
We use three sequential prediction tasks to test the performance of our concatenated embeddings: NER
(CoNLL 2003 Tjong Kim Sang and De Meulder (2003), WNUT-17 Derczynski et al. (2017), and
OntoNotes Hovy et al. (2006)), Slot filling (Snips Coucke et al. (2018)) and POS tagging
(TW-POS 1). We also show results on three classification datasets:
SST2 Socher et al. (2013), Snips intent classification Coucke et al. (2018), and
The results are presented in Table 1. 6B, 27B and 840B are well-known, pre-trained GloVe embeddings Pennington et al. (2014) distributed via the authors site, w2v-30M Pressel et al. (2018) and GN Mikolov et al. (2013) are Word2Vec embeddings trained on a corpus of 30 million tweets and Google News respectively, and the Senna embeddings were trained by Collobert et al. (2011).
We leverage multiple pre-trained embeddings in a model by creating one embeddings table per pre-trained embedding. Each input token in embedded into each vector space and the resulting vectors are concatenated into a single vector. This means that it is possible for there to be a type that is unattested in one pre-trained embedding vocabulary but present in the other. This results in a pre-trained vector from one embedding being concatenated with a randomly initialized vector form the other embedding space.
As hypothesized, we see improvements across tasks, datasets, and model architectures when using multiple embeddings.
Models using the concatenation of pre-trained and randomly initialized embeddings do % worse on average compared to models that only use a single pre-trained embedding. This demonstrates that the performance gains are from the combination of different pre-trained embeddings rather than the increase in the number of parameters in the model. In some cases we were able to improve results further by adding several sets of additional embeddings.
Table 2 summarizes the results of using the multiple embedding approach on internal datasets. These datasets are drawn from the tasks defined earlier and span a variety of specialized domains. Due to the nature of the datasets the results are presented as the relative change in performance. Table 3 is provided to help frame the relative performance numbers from the internal datasets.
The models were trained with MEAD/Baseline Pressel et al. (2018), an open-source framework for developing, training, and deploying NLP models.
|GN + Random init||87.62||0.22||88.64|
|GN + 840B complement to GN||87.72||0.23||88.02|
|GN + 840B matched to GN||88.53||0.55||89.45|
|GN + 840B||88.57||0.44||89.24|
|6B + Random init||90.77||0.17||91.11|
|6B + Senna complement to 6B||90.73||0.29||91.19|
|6B + Senna matched to 6B||91.47||0.18||91.78|
|6B + Senna||91.61||0.25||92.00|
|GloVe twitter 27B||24.9||27.2||68.1||76.1||91.098||0.135|
There are three logical places where the observed improvements could come from. 1) The use of multiple pre-trained embeddings creates a slightly larger model, increasing the network capacity—the embeddings are larger and therefore the projection from the embeddings to the first layer of the model will also be slightly bigger. 2) The use of a second pre-trained embedding increases the vocabulary size and more words are attested. A word that has a pre-trained representation will start the model in a better spot than a randomly initialized representation. 3) The second set of pre-trained embeddings gives a different perspective of the words. Most pre-trained embeddings are trained on different data and encode different biases and senses into the embedding that reflect the quirks and unique contexts found in the pre-training data. This representational diversity will allow a model to capitalize on different senses, or the combination of senses, that would not be present when using a single embedding.
In order to tease apart which of these factors are at play we designed a series of models that aim to isolate each effect and report results in Table 4. First, we train a model that uses a single pre-trained embedding and a second set of vectors that are initialized randomly. If the main improvement is due to increased model capacity this configuration should perform well. The second model uses a special version of the second pre-trained embedding where we remove all the words that already appear in the original pre-trained vocabulary. In this second set of embeddings, randomly initialized vectors are used for the words that are covered in the original vocabulary in order to keep the embeddings size consistent with the previous model. If the main reason for improvement is the increased vocabulary coverage, this model should perform well. The final version of the model also uses a customized version of the second pre-trained embedding. This time we only keep embeddings that are already represented in the original vocabulary. This is designed to test if the main source of improvement is the difference in the representations each pre-trained embedding brings to the table.
From our ablation studies using the above variations on both the SST2 and CoNLL datasets and find that the most important thing is the representational diversity in the pre-trained embeddings. This dovetails nicely with our observation that embeddings trained on distinct datasets tend to perform well together. To further test this hypothesis, we look at the “similarity” of various pre-trained embeddings. We define “similarity” using the overlap of nearest neighbors in the embedding space as in Wendlandt et al. (2018). Specificly we use the average Jaccard overlap percentage between the 10 nearest neighbors for each of the top 200 words in the dataset by frequency. Table 5 shows the overlap of different embeddings with the Glove 6B 100 dimension embedding and how their combination affects the performance. As it can be seen, Senna has the lowest overlap and causes the biggest performance gain.
However, this does not hold for the GoogleNews embedding which also has a low overlap yet the combination actually causes a drop in performance. This can be explained by coverage—the percentage of unique types in the data that are attested in the pre-trained vocabulary. That number is surprisingly low for GoogleNews and causes the GoogleNews representations to be used so rarely they actually cause a drop in performance.
In summary, one should look for two characteristics when combining embeddings: the word representations should have low “similarity” and the unique types in the dataset should be highly attested in both pre-trained vocabularies.
Recent large-scale, contextual, pre-trained models are exciting but produce relatively slow models. We propose a simple, lightweight technique: concatenation of pre-trained embeddings. We show that this technique has a significant impact on error reduction and a negligible effect of speed.
However, the concatenation on any two random pre-trained embeddings is not guaranteed to work well. From our analysis, we are able to suggest a recipe for finding an effective combination: there should be a high degree of coverage of the unique types in each of the pre-trained embedding vocabularies and the word vectors should exhibit representational diversity. In future work, we intend to try other methods of embeddings combination while remaining computationally cheap. We also plan to find more principled ways to quantify the diversity in pre-trained embeddings, which can suggest ways to induce representational diversity into the embedding pre-training procedure itself.
Appendix A Reproducibility
Mead/Baseline is a configuration file driven model training framework. All hyperparameters are fully specified in the congifuration files included with the source code for our experiments.
a.2 Computational Resources
All models were trained on a single NVIDIA 1080Ti. While multiple GPUs were used for training many models in parallel to facilitate a testing many datasets and to estimate the variability of the method the actual model can easily be trained on a single GPU.
To calculate metrics, entity-level F1 is used for NER and slot-filling. In entity level F1 first entities are created from the token level labels and compared to the gold ones. Entities that match on both type and boundaries are considered correct while a mismatch in either causes an error. The F1 score is then calculated from these entities. Accuracy is used for classification and part of speech tagging. Accuracy is defined as the proportion of correct elements to all elements. In classification a single example is an element. In part of speech tagging each token is an element so our accuracy is the the number of correct tokens divided by the number of tokens in the dataset. We use the evaluation code that ships with the framework we use, MEAD/Baseline, which we have bundled with the source code of our experiments.
|Task||Dataset||Model||Embeddings||Number of parameters|
|27B, w2v-30M, 840B||12,090,032|
|27B, w2v-30M, 840B||5,408,332|
a.4 Dataset Information
Relevant information about datasets can be found in Table 7. The majority of data is used as distributed except we convert NER and slot-filling datasets to the IOBES format. All public dataset used are included in the supplementary material. A quick overview of each dataset follows:
CoNLL: A NER dataset based on news text. We converted the IOB labels into the IOBES format. There are 4 entity types, MISC, LOC, PER, and LOC.
WNUT-17: A NER dataset of new and emerging entities based on noisy user text. We converted the BIO labels into the IOBES format. There are 6 entity types, corporation, creative-work, group, location, person, and product.
OntoNotes: A much larger NER dataset. We converted the labels into the IOBES format. There are 18 entity types, CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, and WORK_OF_ART.
Snips: A slot-filling dataset focusing on commands one would give a virtual assistant. We converted the dataset from its normal format of two associated files, one containing surface terms and one containing labels to the more standard CoNLL file format and converted the labels to the IOBES format. There are 39 entity types, album, artist, best_rating, city, condition_description, condition_temperature, country, cuisine, current_location, entity_name, facility, genre, geographic_poi, location_name, movie_name, movie_type, music_item, object_location_type, object_name, object_part_of_series_type, object_select, object_type, party_size_description, party_size_number, playlist, playlist_owner, poi, rating_unit, rating_value, restaurant_name, restaurant_type, served_dish, service, sort, spatial_relation, state, timeRange, track, and year.
TW-POS: A twitter part of speech dataset. There are 25 parts of speech, !, #, $, &, ,, @, A, D, E, G, L, M, N, O, P, R, S, T, U, V, X, Y, Z, ^, and ~.
SST2: A binary sentiment analysis dataset based on movie reviews. We use the version where the training data is made up of phrases.
AG-NEWS: A four class text classification dataset for categorizing news data based on the 4 most common categories. There is not a standardized train and development split (there is a defined test set) so we created our own split which is included in the supplementary material.
Snips-Intent: The intent classification portion of the snips dataset. Again the intents pertain to requests one would make to a virtual assitant. There are 7 intents, SearchScreeningEvent, PlayMusic, AddToPlaylist, BookRestaurant, RateBook, SearchCreativeWork, and GetWeather.
- Cited by: §2.
- Contextual string embeddings for sequence labeling. In COLING, Cited by: §1, §1.
- Named entity recognition with bidirectional lstm-cnns. TACL 4, pp. 357–370. Cited by: §2.
- Natural language processing (almost) from scratch.. Journal of Machine Learning Research 12, pp. 2493–2537. External Links: Cited by: §2.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §2.
- Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy, User-generated Text, Cited by: §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §1.
- OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, Stroudsburg, PA, USA, pp. 57–60. External Links: Cited by: §2.
- Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Cited by: §2.
- End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Cited by: §2.
- Efficient estimation of word representations in vector space. External Links: Cited by: §2.
- GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §2.
- Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1756–1765. External Links: Cited by: §1, §1.
- Deep contextualized word representations. In Proc. of NAACL, Cited by: §1, §1, §2.
- Baseline: a library for rapid modeling, experimentation and development of deep learning algorithms targeting nlp. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 34–40. External Links: Cited by: §2, §2.
- Improving language understanding by generative pre-training. External Links: Cited by: §1, §1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. External Links: Cited by: §2.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Cited by: §2.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §1.
- Factors influencing the surprising instability of word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2092–2102. External Links: Cited by: §3.