Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.
Language modeling is an essential part of many NLP applications, including predictive keyboards, speech recognition, and translation. Recent work has focused on (1) small constrained datasets, such as the Penn Treebank Marcus et al. (1993) and WikiText-103 Merity et al. (2017b), and (2) vast resources with billions of words used to train enormous models with significant computational requirements Radford et al. (2019). This leaves a gap: when a substantial amount of in-domain data is available, but computational power is limited.
We explore how initialising word embeddings using in-domain data can improve language modeling in English. Testing all valid configurations of weight tying, embedding freezing, and initialisation, we find that the standard configuration is not optimal when rare words are present. Instead, the best approach is to initialise with in-domain data, untie the input and output, and freeze the input.
To understand this difference, we run a series of experiments to measure the impact of changing (a) the threshold for replacing rare words with a special symbol; (b) the source of data for initialisation; (c) the amount of training data for the language model; and (d) the hyperparameters for both the baseline and our proposed approach. We find that the improvement comes from improved representation of rare words. These findings are confirmed through experiments on four additional domains, with similar trends.
We also compare our approach to an n-gram language model and a large-scale transformer model. We find that if a large-scale transformer is inappropriate either for computational or modeling reasons, it is best to train an LSTM-based language model with as much data as possible and initialise the embeddings on all available in-domain data.
2 Proposed Approach
We propose initialising the language model’s word embeddings with vectors trained on additional in-domain data.
To make this most effective, we make two other key changes to training.
First, we prevent embeddings from shifting during training.
Without this, the embedding space could become inconsistent as vectors for words seen in training shift while those for words seen only in the additional data stay the same.
Second, we do not tie the weights of the input embeddings and final output layer.
To understand the impact of these factors, we train models with every valid combination of weight tying, freezing, and pretraining.
We experiment with Merity et al. (2017a)’s AWD-LSTM – a high-performing model that can be trained in under a day on a single GPU (without fine-tuning).
We train embeddings using GloVe on Gigaword.
Table 1 shows the results, with icons to concisely describe the different configurations.
There are also four clear sections of performance in the table: (a) frozen random output embeddings; (b) frozen pretrained output embeddings; (c) frozen random input embeddings; (d) various configurations. These results have an asymmetry. Freezing the output embeddings consistently leads to poor performance, even with pretrained embeddings pretrained. In contrast, freezing with pretrained input embeddings leads to some of the best results. We expected freezing with random initialisation to perform poorly, but the drop is modest for input freezing and dramatic for output freezing. This suggests that the two embedding matrices are serving different purposes in the model. The results do support the practise of tying when the input embeddings are random, but the benefit is half as large when they are pretrained.
|= Tied parameters||= Untied parameters|
|= Frozen in training||= Unfrozen in training|
|= Random init.||= Pretrained init.|
For the dataset with rare words we see mostly the same trends. The exception is the bottom six rows. Once rare words are present, random initialisation of the input embeddings is considerably worse than pretraining (third last row). Again, there is an asymmetry between input and output, with the top five models all using pretrained input embeddings, but only three of them using pretrained output embeddings. Tying is also no longer the best approach, with the top three models not tying. Our proposed approach, using pretrained untied embeddings and freezing the input, has the best results.
The only difference between Std and Rare is the lack of UNKs in Rare. This impacts 5.1% of tokens in the validation set (33% of types). While our pretrained embeddings do not cover all of these rare words, they do cover most. The vocabulary from Gigaword that we build vectors for covers 99.5% of the validation word tokens in Std (98% of word types), and 98.8% of the validation word tokens in Rare (84% of word types).
3 When & Why Does Pretraining Help?
To understand the strengths and limitations of this new approach, we consider a series of experiments, each probing a specific variable. To simulate our target scenario, we use 44 million words of Wall Street Journal data from the North American News Corpus (NANC, Graff, 1995). This provides enough data for pretraining, training, validation, and test sets all in the exact same domain (not even varying the newspaper). We apply similar pre-processing as in the previous section, but break the data down into articles rather than sentences and keep rare words.
We compare the six best configurations from Table 1. In all cases, output embeddings are not frozen, so we leave out the symbol. We use only one symbol for pretraining/random because both embeddings are the same in most cases. The exceptions have to indicate pretrained input and random output.
|Our approach, but with random output embeddings and without freezing.|
|Standard approach + pretraining.|
|Our approach, but without freezing.|
|Our approach, but with random output embeddings.|
Other Domains Show the Same Pattern. First we consider varying the domain to make sure this is not an artifact of news data. Table 2 shows results on Covid-19 research Wang et al. (2020), Ubuntu IRC chat Kummerfeld et al. (2019), Reddit, and Wikipedia, tokenised with either Scispacy Neumann et al. (2019) or Stanza Qi et al. (2020). Pretraining consistently helps, while freezing is best on all but Wikipedia. Our approach is consistently either the best or very close to the best.
The Improvement is Due to Rare Words. To probe the impact of rare words, we explore replacing them with UNK (using the same UNK symbol as used in embedding pretraining). We consider four variations, each constructed in two steps. First, we make a list of the words in the original training set and how many times each one occurs. Second, we make modified versions of the training and validation sets, replacing words with UNK if their count in our list is lower than K. For this step, any word that does not appear in our list is treated as if it has a count of zero. We consider K = 0, 1, 2 and 5. K is 0 for all other experiments in this section, which means that no words are replaced with UNK. When K is 1, 2, and 5, the introduction of UNKs means all words in the validation set are seen during language model training.
|UNK Types Dev||0%||13%||21%||33%|
|UNK Tokens Dev||0%||2.3%||3.4%||5.5%|
|UNK Types Train||0%||0%||40%||68%|
|UNK Tokens Train||0%||0%||1.4%||4.1%|
|Train||Pretrain||Train in Pre|
Table 3 shows a clear trend: the benefit of our approach grows as more rare words are present (i.e., K is smaller). Note, it may seem odd that perplexity is higher when K=1 than when K=0 since we have removed rare words. This is probably because when K is 1 there are UNKs in the validation set but not in the language model training set.
Table 4 shows statistics about rare words in the datasets. 71-83% of word types in the training sets occur fewer than five times, but most of these appear frequently in the pretraining sets (compare the first column with the second last column). The same pattern occurs for word tokens. Comparing the statistics for the training set and the pretraining set, the percentage of rare word types is fairly consistent while the percentage of rare tokens consistently goes down.
Pretraining Data Needs to be from a Similar Domain. We would expect that the effectiveness of pretraining will depend on how similar the data is. Table 5 shows results with different embeddings, and indicates the number of words used in pretraining. We see that the value of additional data depends on the domain. Gigaword is also news text and is able to improve performance. The larger GloVe datasets use Wikipedia and CommonCrawl data, which is a poorer match and so does not improve performance. For GloVe we did have to change the embedding dimensions from 400 to 300, which may impact performance slightly.
The Effect Persists When Language Model Training Data is Increased. So far we have only used the additional in-domain data for pretraining. In this experiment, we expand the training set for the language model. We try two variations, one where the data is an exact domain match (NANC) and one where it is also news, but from different newspapers and from a different year (Gigaword). Table 6 shows that as we increase the amount of data our approach and the variant with random output embeddings continue to do best, but the margin shrinks between them and the standard approach. Note, however, that these results are with hyperparameters tuned for the baseline configuration. With tuning the 0.7 gap between our proposal and the baseline for 4xNANC widens to 6.6.
Hyperparameter Tuning Further Improves Results. All of the previous experiments were slightly tipped in favour of the baseline as we used the hyperparameters from Merity et al. (2017a). We do not have the resources to tune for every condition, so instead we focus on a final set of experiments with the 4xNANC condition from Table 6. We run 37 configurations with randomly sampled hyperparameters, using the same configurations for the baseline and our proposed approach (see the supplementary material for details). Figure 1 shows that our approach is even stronger after tuning, with a score that is 6.6 better than the baseline. Comparing the baseline and tuned hyperparameters, some shifted substantially more than others: the learning rate was halved; word dropout was halved; and the number of layers was increased from 3 to 4. The other parameters shifted by 15-30%.
Test Results Confirm Our Observations. Using the best configuration we train the baseline and our proposed approach using 8xNANC (the most our GPU could support). We compare to an n-gram language model trained on all of the NANC data Heafield et al. (2013), and a transformer based model trained on a massive dataset, GPT-2 Radford et al. (2019). While GPT-2 cannot be retrained in a low-compute scenario, it can be used. We compare to GPT-2 without fine-tuning. We evaluate byte-pair encoding (BPE) separately because with BPE tokenisation models have additional information when predicting the second or later piece of a token Merity (2019).
Table 7 shows that for word-level prediction, our approach improves over the baseline and an n-gram language model. BPE breaks up rare words, leading to no improvement over the baseline and while we do better than the 112m parameter GPT-2, we do not do as well as the 774m parameter one (both untuned). Overall, this indicates that for users who require word-level scores and have limited computational resources our approach is an effective way to use additional data when training LSTM language models.
4 Related Work
Embedding Tying. Tying input and output matrices has consistently increased performance while reducing the number of model parameters Press and Wolf (2017); Inan et al. (2017). The improvement is thought to be because otherwise only one input embedding is updated each step and the gradient has to propagate a long way through the model to reach it. Subsequent work has explored more advanced forms of tying, recognising that the role of the input and output matrices are not exactly the same Pappas et al. (2018). This asymmetry has been found in the actual embedding spaces learned and shown to have a negative effect on performance Gao et al. (2019); Demeter et al. (2020). These observations match the patterns we observe and provide theoretical justification for not tying when possible.
In-Domain Data Pretraining and Freezing. Word vectors are frequently used in downstream tasks and recent work has shown that their effectiveness depends on domain similarity Peters et al. (2019); Arora et al. (2020) For language modeling, Kocmi and Bojar (2017) explored random and pretrained embeddings and found improvements, but did not consider tying and freezing. In-domain data is also useful for continuing to train contextual embedding models before fine-tuning Gu et al. (2020); Gururangan et al. (2020), and for monolingual pretraining in machine translation Neishi et al. (2017); Qi et al. (2018); Artetxe et al. (2018). This matches our observations, but does not cover the interactions between freezing and tying we consider.
Handling Rare Words. These remain challenging even for large transformer models Schick and Schütze (2020). Recent work has explored copying mechanisms and character based generation Kawakami et al. (2017), with some success. These ideas are complementary to the results of our work, extending coverage to the open vocabulary case. Due to space and computational constraints we only consider English. For other languages, inflectional morphology and other factors may impact the effectiveness of our approach Shareghi et al. (2019); Cotterell et al. (2018). Our work is also complementary to concurrent work on producing rare words as output Pappas and Mulcaire (2020).
Language Model Types. We focus on a single model type for computational budget reasons. We chose an LSTM because while transformer based models such as GPT-2 now dominate transfer learning, LSTMs continue to be competitive in language modeling Du et al. (2020); Li et al. (2020); Melis et al. (2018); Merity et al. (2017a). Our ideas are orthogonal to this prior work and our findings may apply to transformers as well, but confirming that would require additional experiments.
Initialising embeddings with vectors trained on in-domain data can improve performance by providing better representations for rare words. This effect persists even as more in-domain data is used to train the language model. Our work also suggests that standard model components like embedding tying should be retested as we continue to explore the space of language modeling.
We would like to thank Greg Durrett for helpful feedback on an earlier draft of this paper and the anonymous reviewers for their helpful suggestions. This material is based in part on work supported by DARPA (grant #D19AP00079), Bloomberg (Data Science Research Grant), the National Science Foundation (grant #1815291), and the John Templeton Foundation (grant #61156).
- Note, for frozen output embeddings the bias is not frozen.
- Embedding size 400 and rare word cutoff 5, the same as in the original AWD-LSTM model and GloVe respectively. All other GloVe hyperparameters were set as specified in the original GloVe paper and trained using the released code.
- The script to generate our Rare data from the LDC release is available at: http://jkk.name/emnlp20lm/.
- Dice Icon by Andrew Doane from the Noun Project. Fire and Snowflake Icons by Freepik from www.flaticon.com.
- Contextual embeddings: when are they worth it?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2650–2663. External Links: Cited by: §4.
- Unsupervised neural machine translation. In International Conference on Learning Representations (ICLR), External Links: Cited by: §4.
- Are all languages equally hard to language-model?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 536–541. External Links: Cited by: §4.
- Stolen probability: a structural weakness of neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2191–2197. External Links: Cited by: §4.
- Exploiting syntactic structure for better language modeling: a syntactic distance approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6611–6628. External Links: Cited by: §4.
- Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations (ICLR), External Links: Cited by: §4.
- North american news text corpus LDC95T21. Note: Web Download. Philadelphia: Linguistic Data Consortium External Links: Cited by: §3.
- Train no evil: selective masking for task-guided pre-training. ArXiv abs/2004.09733. External Links: Cited by: §4.
- Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. External Links: Cited by: §4.
- Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696. External Links: Cited by: §3.
- Tying word vectors and word classifiers: a loss framework for language modeling. In International Conference on Learning Representations (ICLR), External Links: Cited by: §4.
- Learning to create and reuse words in open-vocabulary neural language modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1492–1502. External Links: Cited by: §4.
- An exploration of word embedding initialization in deep-learning tasks. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pp. 56–64. External Links: Cited by: §4.
- A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3846–3856. External Links: Cited by: §3.
- Learning architectures from an extended search space for language modeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6629–6639. External Links: Cited by: §4.
- Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Cited by: §1.
- On the state of the art of evaluation in neural language models. In International Conference on Learning Representations (ICLR), External Links: Cited by: §4.
- Regularizing and Optimizing LSTM Language Models. In International Conference on Learning Representations (ICLR), External Links: Cited by: §2, §3, §4.
- Pointer sentinel mixture models. In International Conference on Learning Representations (ICLR), External Links: Cited by: §1.
- Single headed attention rnn: stop thinking with your head. ArXiv abs/1911.11423. External Links: Cited by: §3.
- A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pp. 99–109. External Links: Cited by: §4.
- ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. External Links: Cited by: §3.
- Beyond weight tying: learning joint input-output embeddings for neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 73–83. External Links: Cited by: §4.
- Grounded compositional outputs for adaptive language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Cited by: §4.
- To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 7–14. External Links: Cited by: §4.
- Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163. External Links: Cited by: §4.
- Stanza: a python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 101–108. External Links: Cited by: §3.
- When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 529–535. External Links: Cited by: §4.
- Language models are unsupervised multitask learners. Cited by: §1, §3.
- BERTRAM: improved word embeddings have big impact on contextualized model performance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3996–4007. External Links: Cited by: §4.
- Show some love to your n-grams: a bit of progress and stronger n-gram language modeling baselines. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4113–4118. External Links: Cited by: §4.
- CORD-19: the covid-19 open research dataset. In ACL NLP-COVID Workshop, External Links: Cited by: §3.