Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources
For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model’s generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.
Named entity recognition (NER) is the task of locating and classifying text spans into pre-defined categories such as locations, organizations, etc. It is a fundamental component in many downstream tasks. Most state-of-the-art NER systems employ neural architectures [huang2015bidirectional, lample2016neural, ma2016CNNBLSTMCRF, chiu2016named, peters2017semi, peters2018deep], and thus, depend on a large amount of manually annotated data, which prevents their adaptation to low-resource languages due to the high annotation cost. An effective solution to this problem, which we refer to as cross-lingual named entity recognition, is transferring knowledge from a high-resource source language with abundant annotated data to a low-resource target language with limited or even no annotated data.
In this paper, we attempt to address the extreme scenario of cross-lingual transfer with minimal resources, where there is only one source language with rich labeled data while no labeled data is available in target languages. To tackle this problem, some approaches convert the cross-lingual NER task into a monolingual NER task by performing annotation projection using bilingual parallel text and word alignment information [ni2017weakly]. To eliminate the requirement of parallel texts, some methods propose to translate the labeled data of the source language at the phrase/word level, which inherently provides alignment information for label projection [mayhew2017cheap, xie2018neural]. Instead of generating labeled data in target languages, other works explore language-independent features and perform cross-lingual NER in a direct-transfer manner, where the model trained on the labeled data of the source language is directly tested on target languages [tsai2016cross, ni2017weakly]. Among these methods, cross-lingual word representations are the most prevalent language-independent features. For example, the multilingual version of BERT [devlin2019bert] utilizes WordPiece modeling strategy to project word embeddings of different languages into a shared space and achieved state-of-the-art performance [wu2019beto]. In this paper, we leverage the multilingual BERT [devlin2019bert] as a base model to produce cross-lingual word representations.
While all existing direct transfer based methods straightly evaluate the source-trained model on target languages, we hold the idea that the source-trained model can be further effectively improved. Indeed, recent developments in learning cross-lingual sentence representations suggest that any sentence can be encoded into a shared space by building universal cross-lingual encoders [wu2019beto, lample2019cross]. By simply calculating cosine similarity between sentences in different languages with multilingual BERT [devlin2019bert], we find that it is possible to retrieve a few source examples that are quite similar to a given target example in structure or semantics, as shown in Table 1. In Example #1, both sentences have a structure of “Location - Date”, while in Example #2, both sentences are about people talking about sports. Intuitively, reviewing the structural and semantic information conveyed by similar examples might benefit prediction. Therefore, given a test example in a target language, we propose to first retrieve a small set of similar examples from the source language, and then, use these retrieved examples to fine-tune the model before testing.
However, if the retrieved similar set is too large, too much noise will be introduced via relatively distant examples. And thus to avoid misleading the model with distant examples, the cardinality of a similar group is typically small. In such a scenario, the model is expected to achieve higher performance on a test example after only one or a few fine-tuning steps using the limited-size set of retrieved examples. This inspires us to apply meta-learning, which aims to learn a model that facilitates fast adaptation to new tasks with a minimal amount of training examples [andrychowicz2016learning, vinyals2016matching, finn2017model].
In this paper, we follow the recently proposed model-agnostic meta-learning approach [finn2017model] and extend it to the cross-lingual NER task with minimal resources, where no labeled data is provided in target languages. We construct a set of pseudo-meta-NER tasks using the labeled data from the source language and propose a meta-learning algorithm to find a good model parameter initialization that could fast adapt to new tasks. When it comes to the adaptation phase, we regard each test example as a new task, build a pseudo training set for it, and fine-tune the meta-trained model before testing.
|#1||Ginebra [B-LOC] , 23 may ( EFECOM [B-ORG] ) .|
|\cellcolorgray!25PRESS DIGEST - Israel [B-LOC] - Aug 25 .|
|#2||Flores [B-PER] afirmó: “con él intentaremos ganar en|
|velocidad, que es una de las mejores virtudes que|
|tiene este equipo.”|
|(Flores said: “With him, we will try to win faster,|
|which is one of the best advantages of this team.”)|
|\cellcolorgray!25“Things fell in for us,” said Sorrento [B-PER], who has|
|\cellcolorgray!25 six career grand slams and hit the ninth of the sea-|
|\cellcolorgray!25 son for the Mariners [B-ORG] .|
When adapting meta-learning to cross-lingual NER, we notice that most mispredictions occur on language-specific infrequent entities. It is known that an NER system makes predictions through word features of an entity itself, and the syntactic or semantic information of its context. However, most entities are generally of low frequency in the training corpora of the base model, and thus, entity representations across different languages are not well-aligned in the shared embedding space. That is, for the prediction of a low-frequency entity, over-dependence on its own features will inhibit the model transferring across languages. Therefore, we introduce a masking scheme on named entities during meta-training to weaken the dependence on entities and promote the prediction through contextual information. Meanwhile, considering that the commonly used average loss over all tokens treats each token equally though some tokens may be more difficult to learn and easier to be mispredict, we add a maximum term to the original loss function, which makes the model focus more on such tokens and thus reduce mispredictions, so that the meta-knowledge of these mispredictions will not be transferred to target languages.
To summarize our contributions:
We propose a model-agnostic meta-learning-based approach to tackle cross-lingual NER with minimal resources. To our best knowledge, this is the first successful attempt in adapting meta-learning to NER.
We propose a masking scheme on named entities and augment the loss function with an additional maximum term during meta-training, to facilitate the model’s ability to generalize across different languages.
We evaluate our approach over 5 target languages, i.e., Spanish, Dutch, German, French, and Chinese. We show that the proposed approach significantly outperforms existing state-of-the-art methods across the board.
Cross-lingual NER with Minimal Resources
There are two major branches of work in cross-lingual NER with minimal resources: methods based on annotation projection and methods based on direct transfer.
One of the typical approaches in the annotation projection category is to take bilingual parallel corpora, annotate the source side, and project the annotations to the target using learned word alignment information [ni2017weakly]. However, these methods depend on parallel texts, as well as annotations in at least one side, which is unavailable in many cases. To eliminate the requirement of parallel data, some approaches first translate source-language labeled data at the word/phrase level, and then directly copy labels across languages [xie2018neural, mayhew2017cheap]. Yet, this might bring in too much noise due to sense ambiguity and word order differences. Differently, most approaches based on direct transfer leverage language-independent features to train a model on the source language and then directly apply it on target languages. Cross-lingual word embeddings are the most widely used ones of such features [ni2017weakly, devlin2019bert], while other approaches also introduces word clusters [tackstrom2012] and Wikifier [tsai2016cross] as cross-lingual features.
In this paper, we use a contextual cross-lingual word embedding [devlin2019bert] as the language-independent feature. Rather than directly transferring from the source-learned model to target, we propose to fine-tune the model by converting the minimal-resource cross-lingual transfer problem into a low-resource learning problem, and furthermore, present an enhanced meta-learning algorithm to tackle it. To our best knowledge, we are the first to extend the idea of meta-learning to cross-lingual NER with minimal resources.
Meta-learning has a long history [naik1992meta] and emerged recently as a way to fast adapt to new tasks with very limited data. It has been applied to various tasks such as image classification [koch2015siamese, ravi2017optimization], neural machine translation [gu2018meta], text generation [huang2018natural, qian2019domain], and reinforcement learning [finn2017model, li2018learning]
There are three categories of meta-learning algorithms: learning a metric space which can be used to compare low-resource examples with rich-resource examples [vinyals2016matching, sung2018learning], learning an optimizer to update the parameters of a model [andrychowicz2016learning, chen2018meta], and learning a good parameter initialization of a model [finn2017model, mi2019meta].
Our approach falls into the last category. We extend the idea of model-agnostic meta-learning (MAML) [finn2017model] to the cross-lingual NER with minimal resources by constructing multiple pseudo-meta-NER tasks. Furthermore, we employ a masking scheme and enhance the loss function with an additional maximum item during meta-training to improve the model’s ability to transfer across languages.
Named Entity Recognition is proposed as a sequence labeling problem. Given a sequence with tokens , an NER system is expected to produce a label sequence , where is the -th token and is the corresponding label of . Denote the labeled training data of a source language as and the test data of a target language as . Minimal-resource cross-lingual NER aims to train a model with and it is expected that the model will perform well on .
In this section, we give a brief introduction to multilingual BERT [devlin2019bert] (mBERT), which we leverage as the base model in our approach, since it produces an effective cross-lingual word representation. To ease the explanation, we start with BERT [devlin2019bert] here.
BERT is a language model learned with the Transformer encoder [Vaswani2017attention]. It reads the input sequence at once and learns via two strategies, i.e., masked language modeling and next sentence prediction.
mBERT follows the same model architecture and training procedure as BERT except that it is pre-trained on concatenated Wikipedia data of 104 languages. For tokenization, mBERT utilizes WordPiece embeddings [wu2016google] with a 110k shared vocabulary to facilitate embedding space alignment across different languages.
Following [devlin2019bert] and [wu2019beto], we address cross-lingual NER by adding a linear classification layer with softmax upon the pre-trained mBERT, which can be formulated as:
where is structured as . and are two special tokens as in [devlin2019bert]. and denotes the output of the pre-trained mBERT that corresponds to the input token . denotes the predicted probability distribution for . and are trainable parameters.
The learning loss w.r.t. is modeled as the cross-entropy of the predicted label distribution and the ground-truth one for each token:
where is a one-hot vector of the ground-truth label for the -th input token . And the total loss for learning is the summation of losses on all training examples. It should be noted that, if a word is split into several subwords after tokenization, only the label of the first subword is considered.
Enhanced Meta-Learning for Cross-Lingual NER with Minimal Resources
In this section, we elaborate on the proposed approach. First, we clarify how to construct multiple pseudo-meta-NER tasks with the labeled data of the source language. Then, we describe the meta-training algorithm of our approach. Next, we illustrate the proposed masking mechanism and the augmented loss involved in the meta-training phase. Finally, we show how to adapt the meta-learned model to test examples of target languages. The whole procedure of our algorithm is summarized in Algorithm 1.
In a typical meta-learning scenario, a model is trained on a set of tasks in the meta-training phase, such that the trained model can quickly adapt to new tasks using only a small number of examples. Thus to tackle the minimal-resource cross-lingual NER via meta-learning, we first construct a set of pseudo-meta-NER tasks using the labeled data of the source language.
Assuming there are examples in . We take each as the test set of an individual meta task , and create a pseudo training set for it by retrieving the most similar examples of from . The pseudo-meta-NER tasks can be denoted as:
Specifically, we first compute the sentence representation for each :
where could be any function that is able to produce cross-lingual sentence representations. Here, we employ the multilingual BERT [devlin2019bert] and use the the final hidden vector corresponding to the first input token ([CLS]) as the sentence representation.
Then, we construct by selecting top-K similar examples from . The metric used to measure the similarity between and is:
where and .
In the meta-training phase, we train a model by repeatedly simulating the adaptation phase, where the meta-trained model is fine-tuned with a minimal amount of training data of a new task and then tested on the test data.
Specifically, given the created pseudo-meta-NER tasks and a model parameterized by , we first randomly sample a task to derive new model parameters via gradient updates on the original model parameters , which we refer to as inner-update:
where is the operator that performs gradient descent times with the learning rate to minimize the loss computed on . For example, when applying a single gradient update,
We then evaluate the updated parameters on and further update the meta model by minimizing the loss with respect to , which is referred to as meta-update. When aggregating multiple pseudo-meta-NER tasks, the meta-objective is:
Take a single gradient update with the learning rate , the meta-update can be formulated as:
where is the meta-gradient on task , which can be expanded to:
In Equation 11, is the Jacobian matrix of the update operation that will introduce higher order gradient. To reduce computational cost, we use a first-order approximation by replacing the Jacobian with the identity matrix as in [finn2017model]. Therefore, can be computed as:
Compared with the common training scheme, the meta-learned model is more sensitive to the changes among different tasks, which can promote the learning of the common internal representations rather than the distinctive features of the source language training data . When coming to the adaptation phase, the model could be more sensitive to the features of new tasks, and hence only one or a few fine-tune epochs on a minimal amount of data can make rapid progress without overfitting [finn2017model].
Masking on Named Entities
For cross-lingual NER with minimal resources, the alignments of the entity representations in the shared space are particularly important as this task focuses on understanding entities across languages. However, compared with commonly used words, most entities are of low frequency in the pre-training corpora of the base model. As a result, the learned entity representations across languages are not well-aligned in the shared space.
In order to reduce the dependence on target entity representations and encourage the model to predict through context information, we employ the [MASK] token as introduced in [devlin2019bert] to mask entities at the token level in each training example, i.e., each token inside an entity is randomly masked with a given probability. Then, the masked examples are fed as input data for the model. Note that we re-perform the masking scheme at the beginning of each training epoch.
In Equation 3, the loss for each token is uniformly weighted so that all tokens contribute equally when training the model. Nonetheless, this will result in insufficient learning for those tokens with relatively higher losses. In order to force the model to put more effort in learning from such tokens, we modify the loss function as:
where is a weighting factor. In this way, the potential mispredictions of the high-loss tokens would probably be corrected during meta-training. The benefit of such correction is that the meta-knowledge about the mispredictions, which is also going to be transferred to target tasks, would be reduced, so that the model could achieve better performance after transferring.
In summary, the and in Meta-Training of Algorithm 1 are with the masking scheme and the max loss.
When it comes to the adaptation phase, i.e., applying on target languages, we take each test example as the test set of a target task . We then construct a pseudo training set for each by retrieving top-K similar examples of from the source language training data using the metric in Equation 6. Subsequently, we fine-tune the meta-learned model with the pseudo training set as in Equation 3 via one gradient update, and then use the fine-tuned model to predict labels for the test set , i.e., .
It should be noted that in the adaptation phase, we do not perform the masking scheme to avoid information loss of target entities. Besides, since the size of the pseudo training set is very small, we employ the loss function as in Equation 3 rather than Equation 13 to prevent over-adjusting on uncertain or mispredicted tokens. In fact, when using Equation 13 for adaptation, the model could achieve slightly better performance in some cases but also get worse performance in others due to the mentioned over-adjustment.
In this section, we evaluate our enhanced meta-learning approach for cross-lingual NER with minimal resources and compare our approach to current state-of-the-art methods.
We conduct experiments on four benchmark datasets: CoNLL-2002 Spanish and Dutch NER [tjong2002introduction], CoNLL-2003 English and German NER [tjong2003introduction], Europeana Newspapers French NER [neudecker2016corpus], and MSRA Chinese NER [cao2018adversarial]. Table 2 shows the statistics of all datasets.
CoNLL-2002/2003 is annotated with four entity types: PER, LOC, ORG, and MISC. All datasets are split into a training set, a development set (testa) and a test set (testb).
Europeana Newspapers is annotated with three types: PER, LOC, and ORG. We randomly sample of sentences from the whole data to build a test set.
MSRA is also annotated with three types: PER, LOC, and ORG. Since gold word segmentation is not provided in the test set, we use word segmentation from [zhang2018chinese].
For all experiments, we use English as the source language and the others as target languages, i.e., the model is trained on the training set of English data and evaluated on the test sets of each other language. When transferring to French and Chinese, we relabel the MISC entities in English training data into non-entities for meta-training as there is no MISC in the French and Chinese test sets. Following [wu2019beto], we use the BIO labeling scheme.
We implement our approach with PyTorch 1.0.1. We use the cased multilingual with 12 Transformer blocks, 768 hidden units, 12 self-attention heads, GELU activations [dan2016bridging], a dropout rate of 0.1 and learned positional embeddings. We employ WordPiece embeddings [wu2016google] to split a word into subwords, which are then directly fed into the model without any other pre-processing. We empirically select the hyper-parameters and utilize them in all experiments. Specifically, for sequence length, we employ a sliding window with a maximum length of . When the sequence length is larger than , the last subwords of the first window are kept as the context for the subsequent window. Following [huang2018natural], we select similar examples for both pseudo NER task construction and the adaptation phase. The mask ratio is set to , in Equation 13 is set to , update steps in Equation 7 is set to , the number of sampled pseudo-NER tasks used for one meta-update is set to , and the maximum meta-update steps is set to . Following [wu2019beto], we freeze the parameters of the embedding layer and the bottom three layers of the base model. According to the suggestions of model hyper parameters in [devlin2019bert], for the optimizers of both inner-update and meta-update, we use Adam [kingma2014adam] with learning rate of , while for gradient updates during adaptation, we set the learning rate to 1e-5. Following [tjong2002introduction], we use the phrase level F1-score as the evaluation metric. To reduce the model bias, we carry out runs and report the average performance.
Table 2 presents our results on transferring from English to five other languages, alongside results from previous works. The results show that our approach significantly outperforms the previous state-of-the-art methods across the board, with relative improvements on F1-score compared to the base model ranging from % for Dutch to % for French (with average improvement of %), which demonstrates the effectiveness of the proposed enhanced meta-learning algorithm.
Particularly, compared with the base model, our approach achieves particularly significant improvement on German and French, which can be attributed to our model’s stronger ability to predict through context information. In English, proper nouns of LOCATION, PERSON, etc. often begin with a capital letter while most general nouns do not. As a result, without effective extraction of context information, the base model tends to mislabel capitalized terms for general nouns as entities, and such phenomenon is especially serious when adapting the model to French and German, where capitalization rules differ from English for general nouns, titles, etc. or due to noise in datasets. In contrast, our approach is more robust in such cases due to the introduction of the masking scheme and the max loss, which facilitates the model to label general nouns as non-entities based more on context.
We propose several strategies to enhance the base model, including the meta-training and adaptation procedure, the masking scheme, and the augmented loss. In this section, we conduct ablation study experiments to investigate the influence of these factors. Table 4 shows the results.
Ours w/o max loss, which removes the additional maximum term in the loss function. The performance in terms of F1-score decreases by on average. We conjecture that, without the maximum term, the meta-knowledge from mispredictions is transferred to target tasks along with the meta-model, which hurts performance.
Ours w/o masking, which wipes out the masking scheme during the meta-training phase. This causes a performance drop across all languages, with a maximum drop of F1-score in Spanish. That further demonstrates the necessity of predicting through contextual information.
Ours w/o max loss/masking, which cuts out both the masking scheme and the max loss at once. In this case, our approach degrades into the base model trained with merely model-agnostic meta-learning. This results in a performance drop of F1-score on average, indicating that both the masking scheme and the max loss do bring enhancement to meta-learning for cross-lingual NER with minimal resources.
Ours w/o meta-train/max loss/masking, which further eliminates the meta-training and the adaptation phase from Ours w/o max loss/masking. In that case, our approach degenerates into the base model.From Table 4, we can see that this will lead to a significant and consistent performance drop on all five target languages, which demonstrates the effectiveness of meta-leaning employed in our approach.
We give a case study to analyze the quality of the results produced by our approach and the base model. Table 5 demonstrates that our approach has a stronger ability to transfer semantic information.
In example #1, the base model fails to identity “Secretaría General” as ORG, probably because its most similar phrase “secretary general” in the English dataset is usually labeled as non-entities. However, our approach can recognize it according to the learned semantic information “a PER was selected to replace another PER at the head of an ORG”. Similarly, in example #2, the base model incorrectly labels “Edmond Thieffrylaan” as PER. We suspect that this is because Edmond appears as a part of a person name “Jim Edmond” in the English training data. Surprisingly, the proposed approach labels it as LOC correctly according to the context “clean up the playground on LOC”. Moreover, the baseline model mispredicts the labels of “Krauses” in Example #3 and “奥纳西斯” in Example #4, two unseen entities in the English training data, while our approach gives the right prediction on the basis of context information. Considering the limited space, we provide more cases in Table S1 of the supplementary material.
CL: Cross-lingual transfer using a shared character embedding layer [yang2017transer].
ML: The multi-lingual framework as in \citeauthorlin2018amulti \shortcitelin2018amulti.
MLMT: The multi-lingual multi-task framework as in \citeauthorli2018learning \shortciteli2018learning.
Discussion: Extend to Low-Resource Cross-Lingual NER
Here, we extend the proposed approach to the task of low-resource cross-lingual NER. To simulate a low-resource setting, we use randomly sampled subsets of the training data of a target language. Compared with minimal-resource cross-lingual transfer, we take the same meta-training procedure. For the adaptation phase, we directly use the entire subsets to fine-tune the meta-learned model for efficiency, and then test on the test data of the target language.
We compare our meta-learning based approach with other multi-lingual and multi-task based approaches. For the results not reported in [yang2017transer] and [lin2018amulti], we re-implement their methods based on the open-source github repositories333https://github.com/kimiyoung/transfer444https://github.com/limteng-rpi/mlmt555We re-implement only Spanish and Dutch as the original repositories only provide aligned word embeddings for these two languages.. As presented in Table 6, our approach significantly outperforms other approaches across all target languages with different percentage of labeled data. Compared with the base model, there is an average improvement of 1.47 F1-score. We also study the factor analysis of the enhanced meta-learning algorithm under low-resource setting. One can refer to Table S2 of the supplementary material for details due to the limited space. Similarly, removing any factor in our proposed approach will lead to a performance drop, which further demonstrates that our approach is reasonable.
In this paper, we propose an enhanced meta-learning algorithm for cross-lingual NER with minimal resources, considering that the model could achieve better results after a few fine-tuning steps over a very limited set of structurally/semantically similar examples from the source language. To this end, we propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. Moreover, in order to improve the model’s capability to transfer across different languages, we present a masking scheme and augment the loss function with an additional maximum term during meta-training. Experiments on five target languages show that the proposed approach leads to new state-of-the-art results with a relative F1-score improvement of up to 8.76%. We also extend the approach to low-resource cross-lingual NER, and it also achieves state-of-the-art results.