Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Hai Wang1*    Dian Yu2    Kai Sun3*    Jianshu Chen2    Dong Yu2
1Toyota Technological Institute at Chicago, Chicago, IL, USA
2Tencent AI Lab, Bellevue, WA, USA 3Cornell, Ithaca, NY, USA,,

Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential.

In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.

* This work was done when H. W. and K. S. were at Tencent AI Lab, Bellevue, WA.

1 Introduction

It has been shown that performance on many natural language processing tasks drops dramatically on held-out data when a significant percentage of words do not appear in the training data, i.e., out-of-vocabulary (OOV) words sogaard2012robust; madhyastha2016mapping. A higher OOV rate (i.e., the percentage of the unseen words in the held-out data) may lead to a more severe performance drop kaljahi2015foreebank. OOV problems have been addressed in previous works under monolingual settings, through replacing OOV words with their semantically similar in-vocabulary words madhyastha2016mapping; kolachina2017replacing or using character/word information kim2016character; kim2018learning; chen2018combining or subword information like byte pair encoding (BPE) sennrich2016neural; stratos2017sub.

Recently, fine-tuning a pre-trained deep language model, such as Generative Pre-Training (GPT) radford2018improving and Bidirectional Encoder Representations from Transformers (BERT) devlin2018bert, has achieved remarkable success on various downstream natural language processing tasks. Instead of pre-training many monolingual models like the existing English GPT, English BERT, and Chinese BERT, a more natural choice is to develop a powerful multilingual model such as the multilingual BERT.

However, all those pre-trained models rely on language modeling, where a common trick is to tie the weights of softmax and word embeddings press2017using. Due to the expensive computation of softmax yang2017breaking and data imbalance across different languages, the vocabulary size for each language in a multilingual model is relatively small compared to the monolingual BERT/GPT models, especially for low-resource languages. Even for a high-resource language like Chinese, its vocabulary size k in the multilingual BERT is only half the size of that in the Chinese BERT. Just as in monolingual settings, the OOV problem also hinders the performance of a multilingual model on tasks that are sensitive to token-level or sentence-level information. For example, in the POS tagging problem (Table 2), 11 out of 16 languages have significant OOV issues (OOV rate ) when using multilingual BERT.

According to previous work radford2018improving; devlin2018bert, it is time-consuming and resource-intensive to pre-train a deep language model over large-scale corpora. To address the OOV problems, instead of pre-training a deep model with a large vocabulary, we aim at enlarging the vocabulary size when we fine-tune a pre-trained multilingual model on downstream tasks.

We summarize our contributions as follows: (i) We investigate and compare two methods to alleviate the OOV issue. To the best of our knowledge, this is the first attempt to address the OOV problem in multilingual settings. (ii) By using English as an interlingua, we show that bilingual information helps alleviate the OOV issue, especially for low-resource languages. (iii) We conduct extensive experiments on a variety of token-level and sentence-level downstream tasks to examine the strengths and weaknesses of these methods, which may provide key insights into future directions111Improved models will be available at

2 Approach

We use the multilingual BERT as the pre-trained model. We first introduce the pre-training procedure of this model (Section 2.1) and then introduce two methods we investigate to alleviate the OOV issue by expanding the vocabulary (Section 2.2). Note that these approaches are not restricted to BERT but also applicable to other similar models.

2.1 Pre-Trained BERT

Compared to GPT radford2018improving and ELMo peters2018deep, BERT devlin2018bert uses a bidirectional transformer, whereas GPT pre-trains a left-to-right transformer liu2018generating; ELMo peters2018deep independently trains left-to-right and right-to-left LSTMs peters2017semi to generate representations as additional features for end tasks.

In the pre-training stage, \newcitedevlin2018bert use two objectives: masked language model (LM) and next sentence prediction (NSP). In masked LM, they randomly mask some input tokens and then predict these masked tokens. Compared to unidirectional LM, masked LM enables representations to fuse the context from both directions. In the NSP task, given a certain sentence, it aims to predict the next sentence. The purpose of adding the NSP objective is that many downstream tasks such as question answering and language inference require sentence-level understanding, which is not directly captured by LM objectives.

After pre-training on large-scale corpora like Wikipedia and BookCorpus zhu2015aligning, we follow recent work radford2018improving; devlin2018bert to fine-tune the pre-trained model on different downstream tasks with minimal architecture adaptation. We show how to adapt BERT to different downstream tasks in Figure 1 (left).

Figure 1: Left: fine-tuning BERT on different kinds of end tasks. Right: illustration of joint and mixture mapping (in this example, during mixture mapping, we represent ).

2.2 Vocabulary Expansion


devlin2018bert pre-train the multilingual BERT on Wikipedia in languages, with a shared vocabulary that contains k subwords calculated from the WordPiece model wu2016google. If we ignore the shared subwords between languages, on average, each language has a k vocabulary, which is significantly smaller than that of a monolingual pre-trained model such as GPT (k). The OOV problem tends to be less serious for languages (e.g., French and Spanish) that belong to the same language family of English. However, this is not always true, especially for morphologically rich languages such as German ataman2018compositional; lample2018phrase. OOV problem is much more severe in low-resource scenarios, especially when a language (e.g., Japanese and Urdu) uses an entirely different character set from high-resource languages.

We focus on addressing the OOV issue at subword level in multilingual settings. Formally, suppose we have an embedding extracted from the (non-contextualized) embedding layer in the multilingual BERT (i.e., the first layer of BERT). And suppose we have another set of (non-contextualized) sub-word embeddings , which are pre-trained on large corpora using any standard word embedding toolkit. Specifically, represents the pre-trained embedding for English, and represents the pre-trained embedding for non-English language at the subword level. We denote the vocabulary of , , and by , , and , respectively. For each subword in , we use to denote the pre-trained embedding of word in . and are defined in a similar way as . For each non-English language , we aim to enrich with more subwords from the vocabulary in since contains a larger vocabulary of language compared to .

As there is no previous work to address multilingual OOV issues, inspired by previous solutions designed for monolingual settings, we investigate the following two methods, and all of them can be applied at both word/subword level, though we find subword-level works better (Section 3).

Joint Mapping For each non-English language , we first construct a joint embedding space through mapping to by an orthogonal mapping matrix (i.e., ). When a bilingual dictionary is available or can be constructed based on the shared common subwords (Section 3.1), we obtain by minimizing:

where denotes the Frobenius norm. Otherwise, for language pair (e.g., English-Urdu) that meets neither of the above two conditions, we obtain by an unsupervised word alignment method from MUSE conneau2017word.

We then map to by an orthogonal mapping matrix , which is obtained by minimizing

We denote this method by in our discussion below, where the subscript stands for “joint”.

Mixture Mapping Following the work of \newcitegu2018universal where they use English as “universal tokens” and map all other languages to English to obtain the subword embeddings, we represent each subword in (described in joint mapping) as a mixture of English subwords where those English subwords are already in the BERT vocab . This method, denoted by , is also a joint mapping without the need for learning the mapping from to . Specifically, for each , we obtain its embedding in the BERT embedding space as follows.

where is a set to be defined later, and the mixture coefficient is defined by

where CSLS refers to the distance metric Cross-domain Similarity Local Scaling conneau2017word. We select five with the highest to form set . In all our experiments, we set the number of nearest neighbors in CSLS to . We refer readers to \newciteconneau2017word for details. Figure 1 (right) illustrates the joint and mixture mapping.

3 Experiment

3.1 Experiment Settings

We obtain the pre-trained embeddings of a specific language by training fastText bojanowski2017enriching on Wikipedia articles in that language, with context window and negative sampling . Before training, we first apply BPE sennrich2016neural to tokenize the corpus with subword vocabulary size k. For joint mapping method , we use bilingual dictionaries provided by \newciteconneau2017word. For a language pair where a bilingual dictionary is not easily available, if two languages share a significant number of common subwords (this often happens when two languages belong to the same language family), we construct a bilingual dictionary based on the assumption that identical subwords have the same meaning sogaard2018limitations. We add all unseen subwords from k vocabulary to BERT. We define a word as an OOV word once it cannot be represented as a single word. For example, in BERT, the sentence “Je sens qu’ entre ça et les films de médecins et scientifiques” is represented as “je sens qu ##’ entre [UNK] et les films de [UNK] et scientifiques”, where qu’ is an OOV word since it can only be represented by two subword units: qu and ##’, but it is not OOV at subword level; ça and médecins cannot be represented by any single word or combination of subword units, and thus they are OOV at both word and subword level.

We use the multilingual BERT with default parameters in all our experiments, except that we tune the batch size and training epochs. To have a thorough examination about the pros and cons of the explored methods, we conduct experiments on a variety of token-level and sentence-level classification tasks: part of speech (POS) tagging, named entity recognition (NER), machine translation quality estimation, and machine reading comprehension. See more details in each subsection.

3.2 Discussions about Mapping Methods

Previous work typically assumes a linear mapping exists between embedding spaces of different languages if their embeddings are trained using similar techniques xing2015normalized; madhyastha2016mapping. However, it is difficult to map embeddings learned with different methods sogaard2018limitations. Considering the differences between BERT and fastText: e.g., the objectives, the way to differentiate between different subwords, and the much deeper architecture of BERT, it is very unlikely that the (non-contextualized) BERT embedding and fastText embedding reside in the same geometric space. Besides, due to that BERT has a relatively smaller vocabulary for each language, when we map a pre-trained vector to its portion in BERT indirectly as previous methods, the supervision signal is relatively weak, especially for low-resource languages. In our experiment, we find that the accuracy of the mapping from our pre-trained English embedding to multilingual BERT embedding (English portion) is lower than . In contrast, the accuracy of the mapping between two regular English embeddings that are pre-trained using similar methods (e.g., CBOW or SkipGram mikolov2013distributed) could be above  conneau2017word.

Besides the joint mapping method (Section 2.2), another possible method that could be used for OOV problem in multilingual setting is that, for each language , we map its pre-trained embedding space to embedding by an orthogonal mapping matrix , which is obtained by minimizing . This approach is similar to madhyastha2016mapping, and is referred as independent mapping method below. However, we use examples to demonstrate why these kind of methods are less promising. In Table 1, the first two rows are results obtained by mapping our pre-trained English embedding (using fastText) to the (non-contextualized) BERT embedding. In this new unified space, we align words with CSLS metric, and for each subword that appears in English Wikipedia, we seek its closest neighbor in the BERT vocabulary. Ideally, each word should find itself if it exists in the BERT vocabulary. However, this is not always true. For example, although “however” exists in the BERT vocabulary, independent mapping fails to find it as its own closest neighbor. Instead, it incorrectly maps it to irrelevant Chinese words “盘” (“plate”) and “龙” (“dragon”). A similar phenomenon is observed for Chinese. For example, “册” is incorrectly aligned to Chinese words “书” and “卷”.

Source Lang Source Target probability
English however 盘 (plate) 0.91
however 龙 (dragon) 0.05
Chinese 册 (booklet) 书 (book) 0.49
册 (booklet) 卷 (volume) 0.46
Table 1: Alignment from Independent Mapping.

Furthermore, our POS tagging experiments (Section 3.3) further show that joint mapping does not improve (or even hurt) the performance of the multilingual BERT. Therefore, we use mixture mapping to address and discuss OOV issues in the remaining sections.

ar - 98.23 90.06 53.34 56.70 56.57 56.23 89.8 70.6
bg 97.84 98.23 90.06 98.70 98.22 94.41 97.21 45.7 1.2
da 95.52 96.16 96.35 97.16 96.53 94.15 94.85 38.9 2.8
de 92.87 93.51 93.38 93.58 93.81 91.77 93.12 43.2 5.6
es 95.80 95.67 95.74 96.04 96.92 95.10 95.77 29.4 6.0
fa 96.82 97.60 97.49 95.62 94.90 94.35 95.82 35.6 6.5
fi 95.48 95.74 95.85 87.72 93.35 84.75 89.39 64.9 10.4
fr 95.75 96.20 96.11 95.17 96.59 94.84 95.24 33.9 10.3
hr - 96.27 96.82 95.03 96.49 89.87 93.48 49.5 8.3
it 97.56 97.90 97.95 98.22 98.00 97.63 97.85 30.3 2.3
nl - 92.82 93.30 93.89 92.89 91.30 91.71 35.5 0.3
no - 98.06 98.03 97.25 95.98 94.21 95.83 38.7 4.4
pl - 97.63 97.62 91.62 95.95 87.50 92.56 56.5 13.6
pt - 97.94 97.90 96.66 97.63 95.93 96.90 34.0 8.3
sl - 96.97 96.84 95.02 96.91 89.55 93.97 49.2 7.8
sv 95.57 96.60 96.69 91.23 96.66 90.45 91.92 48.2 17.7
average - 96.60 95.64 92.27 93.60 90.15 92.20 45.2 11.0
Table 2: POS tagging accuracy (%) on the Universal Dependencies v1.2 dataset. : BERT with method . : BERT with randomly picked embedding from BERT. : BERT with method . : word-level OOV rate. : subword-level OOV rate. : \newcitegillick2016multilingual, : \newciteplank2016multilingual.
Approach Precision Recall F1 score
DomainMask peng2017multi 60.8 44.9 51.7
Linear Projection peng2017multi 67.2 41.2 51.1
Updates peng2017supplementary - - 56.1
Updates peng2017supplementary - - 59.0
BERT 56.6 61.7 59.0
60.2 62.8 61.4
63.4 70.8 66.9
Table 3: Performance of various models on the test set of Weibo NER. : Chinese BERT pre-trained over Chinese Wikipedia. We use scripts conlleval for evaluation on NER.

3.3 Monolingual Sequence Labeling Tasks

POS Tagging: We use the Universal Dependencies v1.2 data mcdonald2013universal. For languages with token segmentation ambiguity, we use the gold segmentation following \newciteplank2016multilingual. We consider languages that have sufficient training data and filter out languages that have unsatisfying embedding alignments with English (accuracy is lower than measured by word alignment accuracy or by unsupervised metric in MUSE conneau2017word). Finally, we keep languages. We use the original multilingual BERT (without using CRF Lafferty:2001:CRF:645530.655813 on top of it for sequence labeling) to tune hyperparameters on the dev set and use the fixed hyperparameters for the expanded multilingual model. We do not tune the parameters for each model separately. As shown in Table 2, at both the word and subword level, the OOV rate in this dataset is quite high. Mixture mapping improves the accuracy on out of languages, leading to a absolute gain in average. We discuss the influence of alignments in Section 3.6.

Chinese NER: We are also interested in investigating the performance gap between the expanded multilingual model and a monolingual BERT that is pre-trained on a large-scale monolingual corpus. Currently, pre-trained monolingual BERT models are available in English and Chinese. As English has been used as the interlingua, we compare the expanded multilingual BERT and the Chinese BERT on a Chinese NER task, evaluated on the Weibo NER dataset constructed from social media by \newcitepeng2015named. In the training set, the token-level OOV rate is , and the subword-level OOV rate is . We tune the hyperparameters of each model based on the development set separately and then use the best hyperparameters of each model for evaluation on the test set.

As shown in Table 3, the expanded model outperforms the multilingual BERT on the Weibo NER dataset. We boost the F1 score from to . Compared to the Chinese BERT (), there still exists a noticeable performance gap. One possible reason could be the grammatical differences between Chinese and English. As BERT uses the language model loss function for pre-training, the pre-trained Chinese BERT could better capture the language-specific information comapred to the multilingual BERT.

3.4 Code-Mixed Sequence Labeling Tasks

As the multilingual BERT is pre-trained over languages, it should be able to handle code-mixed texts. Here we examine its performance and the effectiveness of the expanded model in mixed language scenarios, using two tasks as case studies.

Code-Switch Challenge: We first evaluate on the CALCS-2018 code-switched task calcs2018shtask, which contains two NER tracks on Twitter social data: mixed English&Spanish (en-es) and mixed Modern Standard Arabic&Egyptian (ar-eg). Compared to traditional NER datasets constructed from news, the dataset contains a significant portion of uncommon tokens like hashtags and abbreviations, making it quite challenging. For example, in the en-es track, the token-level OOV rate is , and the subword-level OOV rate is ; in the ar-eg track, the token-level OOV rate is , and the subword-level OOV rate is . As shown in Table 4, on ar-eg, we boost the F1 score from to . However, we do not see similar gains on the en-es dataset, probably because that English and Spanish share a large number of subwords, and adding too many new subwords might prevent the model from utilizing the well pre-trained subwords embedding. See Section 3.6 for more discussions.

en-es ar-eg
Model Prec Rec F1 Prec Rec F1
- - 62.4 - - 71.6
- - 63.8 - - -
- - 67.7 - - 81.4
BERT 72.7 63.6 67.8 73.8 75.6 74.7
74.2 60.9 66.9 76.9 77.8 77.3
Table 4: Accuracy (%) on the code-switch challenge. The top two rows are based on the test set, and the bottom three rows are based on the development set. : results from \newcitecalcs2018shtask. : results from \newcitewang2018code.

Machine Translation Quality Estimation: All previous experiments are based on well-curated data. Here we evaluate the expanded model on a language generation task, where sometimes the generated sentences are out-of-control.

We choose the automatic Machine Translation Quality Estimation task and use Task – word-level quality estimation – in WMT18 wmt2018. Given a source sentence and its translation (i.e., target), this task aims to estimate the translation quality (“BAD” or “OK”) at each position: e.g., each token in the source and target sentence, each gap in the target sentence. We use English to German (en-de) SMT translation. On all three categories, the expanded model consistently outperforms the original multilingual BERT (Table 5)222Our evaluation is based on the development set since the test set is only available to participants, and we could not find the submission teams’ performance on the developmenet set..

Words in MT Gaps in MT Words in SRC
Model F1-BAD F1-OK F1-multi F1-BAD F1-OK F1-multi F1-BAD F1-OK F1-multi
\newcitefan2018bilingual 0.68 0.92 0.62 - - - - - -
\newcitefan2018bilingual 0.66 0.92 0.61 0.51 0.98 0.50 - - -
0.51 0.85 0.43 0.29 0.96 0.28 0.42 0.80 0.34
BERT 0.58 0.91 0.53 0.47 0.98 0.46 0.48 0.90 0.43
0.60 0.91 0.55 0.50 0.98 0.49 0.49 0.90 0.44
Table 5: WMT18 Quality Estimation Task 2 for the ende SMT dataset. : result from \newcitespecia2018findings. MT: machine translation, e.g., target sentence, SRC: source sentence. F1-OK: F1 score for “OK” class; F1-BAD: F1 score for “BAD” class; F1-multi: multiplication of F1-OK and F1-BAD.

3.5 Sequence Classification Tasks

Finally, we evaluate the expanded model on sequence classification in a mixed-code setting, where results are less sensitive to unseen words.

Code-Mixed Machine Reading Comprehension: We consider the mixed-language machine reading comprehension task. Since there is no such public available dataset, we construct a new Chinese-English code-mixed machine reading comprehension dataset based on 37,436 unduplicated utterances obtained from the transcriptions of a Chinese and English mixed speech recognition corpus King-ASR-065-1333 We generate a multiple-choice machine reading comprehension problem (i.e., a question and four answer options) for each utterance. A question is an utterance with an English text span removed (we randomly pick one if there are multiple English spans) and the correct answer option is the removed English span. Distractors (i.e., wrong answer options) come from the top three closest English text spans, which appear in the corpus, based on the cosine similarity of word embeddings trained on the same corpus. For example, given a question “突然听到 21     ,那强劲的鼓点,那一张张脸。” (“Suddenly I heard 21     , and the powerful drum beats reminded me of the players.”) and four answer options { “forever”, “guns”, “jay”, “twins” }, the task is to select the correct answer option “guns” (“21 Guns” is a song by the American rock band Green Day). We split the dataset into training, development, and testing of size 36,636, 400, 400, respectively. Annotators manually clean and improve the quality problems by generating more confusing distractors in the dev and testing sets to guarantee that these problems are error-free and challenging.

In this experiment, for each BERT model, we follow its default hyperparameters. As shown in Table 6, the expanded model improves the multilingual BERT () by in accuracy. Human performance () indicates that this is not an easy task even for human readers.

Model Development Test
38.2 37.3
BERT 38.7 38.1
39.4 39.3
40.0 45.0
Table 6: Accuracy (%) of models on the code-mixed reading comprehension dataset. : pre-trained English BERT. : pre-trained Chinese BERT.

3.6 Discussions

In this section, we first briefly investigate whether the performance boost indeed comes from the reduction of OOV and then discuss the strengths and weaknesses of the methods we investigate.

First, we argue that it is essential to alleviate the OOV issue in multilingual settings. Taking the POS tagging task as an example, we find that most errors occur at the OOV positions (Table 7 in Section 3.3). In the original BERT, the accuracy of OOV words is much lower than that of non-OOV words, and we significantly boost the accuracy of OOV words with the expanded BERT. All these results indicate that the overall improvement mostly comes from the reduction of OOV.

Lang non-OOV OOV non-OOV OOV
fi 98.1 81.3 98.5 90.2
fr 97.0 90.2 97.2 95.6
hr 97.8 91.9 97.7 94.5
pl 98.8 84.6 99.0 93.2
pt 98.8 91.5 98.6 94.8
sl 98.6 91.6 98.7 95.1
sv 97.4 82.9 98.2 94.8
average 98.1 87.7 98.3 94.0
Table 7: POS tagging accuracy (%) for OOV tokens and non-OOV tokens on the Universal Dependencies v1.2 dataset, where the OOV/non-OOV are defined at word level with the original BERT vocabulary.

We also observe that the following factors may influence the performance of the expanded model.

Subwords: When expanding the vocabulary, it is critical to add only frequent subwords. Currently, we add all unseen subwords from the k vocabulary (Section 3.1), which may be not an optimal choice. Adding too many subwords may prevent the model from utilizing the information from pre-trained subword embedding in BERT, especially when there is a low word-level overlap between the training and test set.

Language: We also find that languages can influence the performance of the vocabulary expansion through the following two aspects: the alignment accuracy and the closeness between a language and English. For languages that are closely related to English such as French and Dutch, it is relatively easy to align their embeddings to English as most subword units are shared sogaard2018limitations; conneau2017word. In such case, the BERT embedding already contains sufficient information, and therefore adding additional subwords may hurt the performance. On the other hand, for a distant language such as Polish (Slavic family), which shares some subwords with English (Germanic family), adding subwords to BERT brings performance improvements. In the meantime, as Slavic and Germanic are two subdivisions of the Indo-European languages, we find that the embedding alignment methods perform reasonably well. For these languages, vocabulary expansion is usually more effective, indicated by POS tagging accuracies for Polish, Portuguese, and Slovenian (Table 2). For more distant languages like Arabic (Semitic family) that use different character sets, it is necessary to add additional subwords. However, as the grammar of such a language is very different from that of English, how to accurately align their embeddings is the main bottleneck.

Task: We see more significant performance gains on NER, POS and MT Quality Estimation, possibly because token-level understanding is more critical for these tasks, therefore alleviating OOV helps more. In comparison, for sequence level classification tasks such as machine reading comprehension (Section 3.5), OOV issue is less severe since the result is based on the entire sentence.

4 Related Work

OOV poses challenges for many tasks pinter2017mimicking such as machine translation razmara2013graph; sennrich2016neural and sentiment analysis kaewpitakkun2014sentiment. Even for tasks such as machine reading comprehension that are less sensitive to the meanings of each word, OOV still hurts the performance chu2017broad; zhang2018subword. We now discuss previous methods in two settings.

4.1 Monolingual Setting

Most previous work address the OOV problems in monolingual settings. Before more fine-grained encoding schema such as BPE sennrich2016neural is proposed, prior work mainly focused on OOV for token-level representations taylor2011towards; kolachina2017replacing. Besides simply assigning random embeddings to unseen words dhingra2017gated or using an unique symbol to replace all these words with a shared embedding hermann2015teaching, a thread of research focuses on refining the OOV representations based on word-level information, such as using similar in-vocabulary words luong2015addressing; chousing2015; tafforeau2015adapting; li2016towards, mapping initial embedding to task-specific embedding rothe2016ultradense; madhyastha2016mapping, using definitions of OOV words from auxiliary data long2016leveraging; bahdanau2017learning, and tracking contexts to build/update representations henaff2016tracking; kobayashi2017neural; ji2017dynamic; zhao2018addressing.

Meanwhile, there have been efforts in representing words by utilizing character-level zhang2015character; ling2015finding; ling2015character; kim2016character; gimpelcharagram2016 or subword-level representations sennrich2016neural; bojanowski2017enriching. To leverage the advantages in character and (sub)word level representation, some previous work combine (sub)word- and character-level representations santos2014learning; dos2015boosting; yu2017joint or develop hybrid word/subword-character architectures chung2016character; luong2016achieving; pinter2017mimicking; bahdanau2017learning; matthews2018using; li2018subword. However, all those approaches assume monolingual setting, which is different from ours.

4.2 Multilingual Setting

Addressing OOV problems in a multilingual setting is relatively under-explored, probably because most multilingual models use separate vocabularies jaffe2017generating; platanios2018contextual. While there is no direct precedent, previous work show that incorporating multilingual contexts can improve monolingual word embeddings zou2013bilingual; andrew2013deep; faruqui2014improving; lu2015deep; ruder2017survey.


madhyastha2017learning increase the vocabulary size for statistical machine translation (SMT). Given an OOV source word, they generate a translation list in target language, and integrate this list into SMT system. Although they also generate translation list (similar with us), their approach is still in monolingual setting with SMT. \newcitecotterell2017cross train char-level taggers to predict morphological taggings for high/low resource languages jointly, alleviating OOV problems to some extent. In contrast, we focus on dealing with the OOV issue at subword level in the context of pre-trained BERT model.

5 Conclusion

We investigated two methods (i.e., joint mapping and mixture mapping) inspired by monolingual solutions to alleviate the OOV issue in multilingual settings. Experimental results on several benchmarks demonstrate the effectiveness of mixture mapping and the usefulness of bilingual information. To the best of our knowledge, this is the first work to address and discuss OOV issues at the subword level in multilingual settings. Future work includes: investigating other embedding alignment methods such as Gromov-Wasserstein alignment alvarez2018gromov upon more languages; investigating approaches to choose the subwords to be added dynamically.


We thank the anonymous reviewers for their encouraging and helpful feedback.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description