RobBERT: a Dutch RoBERTa-based Language Model
Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT (Bi-directional Encoders for Transformers), which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies showed that BERT models trained on a single language significantly outperform the multilingual results. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train their Dutch BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We show that RobBERT improves state of the art results in Dutch-specific language tasks, and also outperforms other existing Dutch BERT-based models in sentiment analysis. These results indicate that RobBERT is a powerful pre-trained model for fine-tuning for a large variety of Dutch language tasks. We publicly release this pre-trained model in hope of supporting further downstream Dutch NLP applications.
The advent of neural networks in natural language processing (NLP) has significantly improved state-of-the-art results within the field. While recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) initially dominated the field, recent models started incorporating attention mechanisms and then later dropped the recurrent part and just kept the attention mechanisms in so-called transformer models (Vaswani et al., 2017). This latter type of model caused a new revolution in NLP and led to popular language models like GPT-2 (Radford et al., 2018, 2019) and ELMo (Peters et al., 2018). BERT (Devlin et al., 2019) improved over previous transformer models and recurrent networks by allowing the system to learn from input text in a bidirectional way, rather than only from left-to-right or the other way around. This model was later re-implemented, critically evaluated and improved in the RoBERTa model (Liu et al., 2019).
These large-scale transformer models provide the advantage of being able to solve NLP tasks by having a common, expensive pre-training phase, followed by a smaller fine-tuning phase. The pre-training happens in an unsupervised way by providing large corpora of text in the desired language. The second phase only needs a relatively small annotated data set for fine-tuning to outperform previous popular approaches in one of a large number of possible language tasks.
While language models are usually trained on English data, some multilingual models also exist.
These are usually trained on a large quantity of text in different languages.
For example, Multilingual-BERT is trained on a collection of corpora in 104 different languages (Devlin et al., 2019), and generalizes language components well across languages (Pires et al., 2019).
However, models trained on data from one specific language usually improve the performance of multilingual models for this particular language (Martin et al., 2019; de Vries et al., 2019).
Training a RoBERTa model (Liu et al., 2019) on a Dutch dataset thus has a lot of potential for increasing performance for many downstream Dutch NLP tasks.
In this paper, we introduce RobBERT
2 Related Work
Transformer models have been successfully used for a wide range of language tasks. Initially, transformers were introduced for use in machine translation, where they vastly improved state-of-the-art results for English to German in an efficient manner (Vaswani et al., 2017). This transformer model architecture resulted in a new paradigm in NLP with the migration from sequence-to-sequence recurrent neural networks to transformer-based models by removing the recurrent component and only keeping attention. This cornerstone was used for BERT, a transformer model that obtained state-of-the-art results for eleven natural language processing tasks, such as question answering and natural language inference (Devlin et al., 2019). BERT is pre-trained with large corpora of text using two unsupervised tasks. The first task is word masking (also called the Cloze task (Taylor, 1953) or masked language model (MLM)), where the model has to guess which word is masked in certain position in the text. The second task is next sentence prediction. This is done by predicting if two sentences are subsequent in the corpus, or if they are randomly sampled from the corpus. These tasks allowed the model to create internal representations about a language, which could thereafter be reused for different language tasks. This architecture has been shown to be a general language model that could be fine-tuned with little data in a relatively efficient way for a very distinct range of tasks and still outperform previous architectures Devlin et al. (2019).
Transformer models are also capable of generating contextualized word embeddings. These contextualized embeddings were presented by Peters et al. (2018) and addressed the well known issue with a word’s meaning being defined by its context (e.g. “a stick” versus “let’s stick to”). This lack of context is something that traditional word embeddings like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) lack, whereas BERT automatically incorporates the context a word occurs in.
Another advantage of transformer models is that attention allows them to better resolve coreferences between words (Joshi et al., 2019). A typical example for the importance of coreference resolution is “The trophy doesnât fit in the brown suitcase because itâs too big.”, where the word “it” would refer to the the suitcase instead of the trophy if the last word was changed to “small” (Levesque et al., 2012). Being able to resolve these coreferences is for example important for translating to languages with gender, as suitcase and trophy have different genders in French.
Although BERT has been shown to be a useful language model, it has also received some scrutiny on the training and pre-processing of the language model. As mentioned before, BERT uses next sentence prediction (NSP) as one of its two training tasks. In NSP, the model has to predict whether two sentences follow each other in the training text, or are just randomly selected from the corpora. The authors of RoBERTa (Liu et al., 2019) showed that while this task made the model achieve a better performance, it was not due to its intended reason, as it might merely predict relatedness rather than subsequent sentences. That Devlin et al. (2019) trained a better model when using NSP than without NSP is likely due to the model learning long-range dependencies in text from its inputs, which are longer than just the single sentence on itself. As such, the RoBERTa model uses only the MLM task, and uses multiple full sentences in every input. Other research improved the NSP task by instead making the model predict the correct order of two sentences, where the model thus has to predict whether the sentences occur in the given order in the corpus, or occur in flipped order (Lan et al., 2019).
Devlin et al. (2019) also presented a multilingual model (mBERT) with the same architecture as BERT, but trained on Wikipedia corpora in 104 languages. Unfortunately, the quality of these multilingual embeddings is often considered worse than their monolingual counterparts. Rönnqvist et al. (2019) illustrated this difference in quality for German and English models in a generative setting. The monolingual French CamemBERT model (Martin et al., 2019) also compared their model to mBERT, which performed poorer on all tasks. More recently, de Vries et al. (2019) also showed similar results for Dutch using their BERTje model, outperforming multilingual BERT in a wide range of tasks, such as sentiment analysis and part-of-speech tagging. Since this work is concurrent with ours, we compare our results with BERTje in this paper.
3 Pre-training RobBERT
This section describes the data and training regime we used to train our Dutch RoBERTa-based language model called RobBERT.
We pre-trained our model on the Dutch section of the OSCAR corpus, a large multilingual corpus which was obtained by language classification in the Common Crawl corpus (Ortiz Suárez et al., 2019). This Dutch corpus has 6.6 billion words, totalling 39 GB of text. It contains 126,064,722 lines of text, where each line can contain multiple sentences. Subsequent lines are however not related to each other, due to the shuffled nature of the OSCAR data set. For comparison, the French RoBERTa-based language model CamemBERT (Martin et al., 2019) has been trained on the French portion of OSCAR, which consists of 138 GB of scraped text.
Our data differs in several ways from the data used to train BERTje, a BERT-based Dutch language model (de Vries et al., 2019). Firstly, they trained the model on an assembly of multiple Dutch corpora totalling only 12 GB. Secondly, they used WordPiece as subword embeddings, since this is what the original BERT architecture uses. RobBERT on the other hand uses Byte Pair Encoding (BPE), which is also used by GPT-2 Radford et al. (2019) and RoBERTa (Liu et al., 2019).
RobBERT shares its architecture with RoBERTa’s base model, which itself is a replication and improvement over BERT (Liu et al., 2019). The architecture of our language model is thus equal to the original BERT model with 12 self-attention layers with 12 heads (Devlin et al., 2019). One difference with the original BERT is due to the different pre-training task specified by RoBERTa, using only the MLM task and not the NSP task. The training thus only uses word masking, where the model has to predict which words were masked in certain positions of a given line of text. The training process uses the Adam optimizer (Kingma and Ba, 2017) with polynomial decay of the learning rate and a ramp-up period of 1000 iterations, with parameters (a common default) and RoBERTa’s default . Additionally, we also used a weight decay of 0.1 as well as a small dropout of 0.1 to help prevent the model from overfitting (Srivastava et al., 2014).
We used a computing cluster in order to efficiently pre-train our model. More specifically, the pre-training was executed on a computing cluster with 20 nodes with 4 Nvidia Tesla P100 GPUs (16 GB VRAM each) and 2 nodes with 8 Nvidia V100 GPUs (having 32 GB VRAM each). This pre-training happened in fixed batches of 8192 sentences by rescaling each GPUs batch size depending on the number of GPUs available, in order to maximally utilize the cluster without blocking it entirely for other users. The model trained for two epochs, which is over 16k batches in total. With the large batch size of 8192, this equates to 0.5M updates for a traditional BERT model. At this point, the perplexity did not decrease any further.
We evaluated RobBERT in several different settings on multiple downstream tasks. First, we compare its performance with other BERT-models and state-of-the-art systems in sentiment analysis, to show its performance for classification tasks. Second, we compare its performance in a recent Dutch language task, namely the disambiguation of demonstrative pronouns, which allows us to additionally compare the zero-shot performance of our and other BERT models, i.e. using only the pre-trained model without any fine-tuning.
|Task + model||ACC (95% CI) [%]||F1 [%]||ACC (95% CI) [%]||F1 [%]|
|Sentiment Analysis (DBRD)|
|van der Burgh and Verberne (2019)||—||—||93.8*||—|
|BERTje (de Vries et al., 2019)||—||—||93.0**||—|
|RobBERT (ours)||86.730 (85.32, 88.14)||86.729||94.422 (93.47,95.38)||94.422|
|Baseline Allein et al. (2020)||—||—||75.03***||—|
|mBERT Devlin et al. (2019)||92.157 (92.06,92.25)||90.898||98.285 (98.24,98.33)||98.033|
|BERTje (de Vries et al., 2019)||93.096 (92.84, 93.36)||91.279||98.268 (98.22,98.31)||98.014|
|RobBERT (ours)||97.006 (96.95, 97.07)||96.571||98.406 (98.36, 98.45)||98.169|
4.1 Sentiment Analysis
We replicated the high-level sentiment analysis task used to evaluate BERTje (de Vries et al., 2019) to be able to compare our methods. This task uses a dataset called Dutch Book Reviews Dataset (DBRD), in which book reviews scraped from hebban.nl are labeled as positive or negative van der Burgh and Verberne (2019). Although the dataset contains 118,516 reviews, only 22,252 of these reviews are actually labeled as positive or negative. The DBRD dataset is already split in a balanced 10% test and 90% train split, allowing us to easily compare to other models trained for solving this task. This dataset was released in a paper analysing the performance of an ULMFiT model (Universal Language Model Fine-tuning for Text Classification model) (van der Burgh and Verberne, 2019).
We fine-tuned RobBERT on the first 10,000 training examples as well as on the full data set. While the ULMFiT model is first fine-tuned using the unlabeled reviews before training the classifier (van der Burgh and Verberne, 2019), it is unclear whether BERTje also first fine-tuned on the unlabeled reviews or only used the labeled data for fine-tuning the pretrained model. It is also unclear how it dealt with reviews being longer than the maximum number of tokens allowed as input in BERT models, as the average book review length is 547 tokens, with 40% of the documents being longer than our RobBERT model can handle. For a safe comparison, we thus decided to discard the unlabeled data and only use the labeled data for training and test purposes (20,028 and 2,224 examples respectively), and compare approaches for dealing with too long input sequences. We trained our model for 2000 iterations with a batch size of 128 and a warm-up of 500 iterations, reaching a learning rate of . We found that our model performed better when trained on the last part of the book reviews than on the first part. This is likely due to this part containing concluding remarks summarizing the overall sentiment. While BERTje was slightly outperformed by ULMFiT de Vries et al. (2019); van der Burgh and Verberne (2019), we can see that RobBERT achieves better performance than both on the test set, although the performance difference is not statistically significantly better than the ULMFiT model, as can be seen in Table 1.
4.2 Die/Dat Disambiguation
Aside from classic natural language processing tasks in previous subsections, we also evaluated its performance on a task that is specific to Dutch, namely disambiguating “die” and “dat” (= “that” in English). In Dutch, depending on the sentence, both terms can be either demonstrative or relative pronouns; in addition they can also be used in a subordinating conjunction, i.e. to introduce a clause. The use of either of these words depends on the gender of the word it refers to. Distinguishing these words is a task introduced by Allein et al. (2020), who presented multiple models trained on the Europarl (Koehn, 2005) and SoNaR corpora (Oostdijk et al., 2013). The results ranged from an accuracy of 75.03% on Europarl to 84.56% on SoNaR.
For this task, we use the Dutch version of the Europarl corpus (Koehn, 2005), which we split in 1.3M utterances for training, 319k for validation, and 399k for testing. We then process every sentence by checking if it contains “die” or “dat”, and if so, add a training example for every occurrence of this word in the sentence, where a single occurrence is masked. For the test set for example, this resulted in about 289k masked sentences. We then test two different approaches for solving this task on this dataset. The first approach is making the BERT models use their MLM task and guess which word should be filled in this spot, and check if it has more confidence in either “die” or “dat” (by checking the first 2,048 guesses at most, as this seemed sufficiently large). This allows us to compare the zero-shot BERT models, i.e. without any fine-tuning after pre-training, for which the results can be seen in Table 2. The second approach uses the same data, but creates two sentences by filling in the mask with both “die” and “dat”, appending both with the [SEP] token and making the model predict which of the two sentences is correct. The fine-tuning was performed using 4 Nvidia GTX 1080 Ti GPUs and evaluated against the same test set of 399k utterances. As before, we fine-tuned the model twice: once with the full training set and once with a subset of 10k utterances from the training set for illustrating the benefits of pre-training on low-resource tasks.
|ZeroR (majority class)||66.70|
|mBERT Devlin et al. (2019)||90.21|
|BERTje (de Vries et al., 2019)||94.94|
RobBERT outperforms previous models as well as other BERT models both with as well as without fine-tuning (see Table 1 and Table 2). It is also able to reach similar performance using less data. The fact that zero-shot RobBERT outperforms other zero-shot BERT models is also an indication that the base model has internalised more knowledge about Dutch than the other two have. The reason RobBERT and other BERT models outperform the previous RNN-based approach is likely the transformers ability to deal better with coreference resolution (Joshi et al., 2019), and by extension better in deciding which word the “die” or “dat” belongs to.
The training and evaluation code of this paper as well as the RobBERT model and the fine-tuned models are publicly available for download on https://github.com/iPieter/RobBERT.
6 Future Work
There are several possible improvements as well as interesting future directions for this research, for example in training similar models. First, as BERT-based models are a very active field of research, it is interesting to experiment with change the pre-training tasks with new unsupervised tasks when they are discovered, such as the sentence order prediction (Lan et al., 2019). Second, while RobBERT is trained on lines that contain multiple sentences, it does not put subsequent lines of the corpus after each other due to the shuffled nature of the OSCAR corpus (Ortiz Suárez et al., 2019). This is unlike RoBERTa, which does put full sentences next to each other if they fit, in order to learn the long-range dependencies between words that the original BERT learned using its controversial NSP task. It could be interesting to use the processor used to create OSCAR in order to create an unshuffled version to train on, such that this technique can be used on the data set. Third, RobBERT uses the same tokenizer as RoBERTa, meaning it uses a tokenizer built for the English language. Training a new model using a custom Dutch tokenizer, e.g. using the newly released HuggingFace tokenizers library (Wolf et al., 2019), could increase the performance even further. On the same note, incorporating more Unicode glyphs as separate tokens can also be beneficial for example for tasks related to conversational agents (Delobelle and Berendt, 2019).
RobBERT itself could also be used in new settings to help future research. First, RobBERT could be used in different settings thanks to the renewed interest of sequence-to-sequence models due to their results on a vast range of language tasks (Raffel et al., 2019; Lewis et al., 2019). These models use a BERT-like transformer stack for the encoder and depending on the task a generative model as a decoder. These advances once again highlight the flexibility of the self-attention mechanism and it might be interesting to research the re-usability of RobBERT in these type of architectures. Second, there are many Dutch language tasks that we did not examine in this paper, for which it may also be possible to achieve state-of-the-art results when fine-tuned on this pre-trained model.
We introduced a new language model for Dutch based on RoBERTa, called RobBERT, and showed that it outperforms earlier approaches for Dutch language tasks, as well as other BERT-based language models. We thus hope this model can serve as a base for fine-tuning on other tasks, and thus help foster new models that might advance results for Dutch language tasks.
Pieter Delobelle was supported by the Research Foundation - Flanders under EOS No. 30992574 and received funding from the Flemish Government under the âOnderzoeksprogramma ArtificiÃ«le Intelligentie (AI) Vlaanderenâ programme. Thomas Winters is a fellow of the Research Foundation-Flanders (FWO-Vlaanderen). Most computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government â department EWI. We are especially grateful to Luc De Raedt for his guidance as well as for providing the facilities to complete this project. We are thankful to Liesbeth Allein and her supervisors for inspiring us to use the die/dat task. We are also grateful to Ott et al. (2019); Paszke et al. (2019); Haghighi et al. (2018); Wolf et al. (2019) for their software packages.
- The model named itself RobBERT when it was prompted with “Ik heet maskBERT.” (“My name is maskBERT.”), which we found quite a suitable name.
- Binary and Multitask Classification Model for Dutch Anaphora Resolution: Die/Dat Prediction. arXiv:2001.02943 [cs] (en). External Links: Cited by: §4.2, Table 1.
- BERTje: A Dutch BERT Model. arXiv:1912.09582 [cs] (en). External Links: Cited by: §1, §2, §3.1, §4.1, §4.1, Table 1, Table 2.
- Time to take emoji seriously: They vastly improve casual conversational models. In Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, Brussels, Belgium, pp. 1–7. Cited by: §6.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §1, §2, §2, §2, §3.2, Table 1, Table 2.
- PyCM: Multiclass confusion matrix library in Python. Journal of Open Source Software 3 (25), pp. 729. External Links: Cited by: Acknowledgements.
- SpanBERT: improving pre-training by representing and predicting spans. External Links: Cited by: §2, §4.2.
- Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (en). External Links: Cited by: §3.2.
- Europarl: A parallel corpus for statistical machine translation. In MT Summit, Vol. 5, pp. 79–86. Cited by: §4.2, §4.2.
- Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2, §6.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §2.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. External Links: Cited by: §6.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (en). External Links: Cited by: §1, §1, §2, §3.1, §3.2.
- CamemBERT: a Tasty French Language Model. arXiv:1911.03894 [cs] (en). External Links: Cited by: §1, §2, §3.1.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
- The construction of a 500-million-word reference corpus of contemporary written Dutch. In Essential Speech and Language Technology for Dutch: Results by the STEVIN-Programme, Cited by: §4.2.
- Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Cardiff, United Kingdom. Cited by: §3.1, §6.
- Fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: Acknowledgements.
- PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: Acknowledgements.
- Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §1, §2.
- How multilingual is multilingual bert?. arXiv preprint arXiv:1906.01502. Cited by: §1.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §3.1.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat] (en). External Links: Cited by: §6.
- Is multilingual BERT fluent in language generation?. In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, Turku, Finland, pp. 29–36. Cited by: §2.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.2.
- âCloze procedureâ: a new tool for measuring readability. Journalism Bulletin 30 (4), pp. 415–433. Cited by: §2.
- The merits of Universal Language Model Fine-tuning for Small Datasets – a case with Dutch book reviews. arXiv:1910.00896 [cs] (en). External Links: Cited by: §4.1, §4.1, Table 1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §2.
- HuggingFace’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §6, Acknowledgements.