Zero-Shot Cross-Lingual Transfer with Meta Learning

Zero-Shot Cross-Lingual Transfer with Meta Learning


Learning what to share between tasks has been a topic of high importance recently, as strategic sharing of knowledge has been shown to improve the performance of downstream tasks. The same applies to sharing between languages, and is especially important when considering the fact that most languages in the world suffer from being under-resourced. In this paper, we consider the setting of training models on multiple different languages at the same time, when little or no data is available for languages other than English. We show that this challenging setup can be approached using meta-learning, where, in addition to training a source language model, another model learns to select which training instances are the most beneficial. We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks (natural language inference, question answering). Our extensive experimental setup demonstrates the consistent effectiveness of meta-learning, on a total 16 languages. We improve upon state-of-the-art on zero-shot and few-shot NLI and QA tasks on the XNLI and X-WikiRe datasets, respectively. We further conduct a comprehensive analysis which indicates that correlation of typological features between languages can further explain when parameter sharing learned via meta learning is beneficial.


1 Introduction

There are more than 7 000 languages spoken in the world, over 90 of which have more than 10 million native speakers each Eberhard et al. (2019). In spite of this, very few languages have proper linguistic resources when it comes to natural language understanding tasks. Although there is a growing awareness in the field, as evidenced by the release of datasets such as XNLI (Conneau et al., 2018), most NLP research still only considers English Bender (2019).

While one solution to this issue is to collect annotated data for all languages, this process is both too time-consuming and expensive to be feasible. Additionally, it is not trivial to train a model for a task at hand in a particular language (e.g., English) and apply it directly on another language where limited training data is available (i.e., , low-resource languages). Because of this, it is important to investigate strategies which allow one to leverage the large amount of training data available for English, for other languages.

Meta learning has recently been shown to be beneficial for several machine learning tasks (Koch et al., 2015; Vinyals et al., 2016; Santoro et al., 2016; Finn et al., 2017; Ravi and Larochelle, 2017; Nichol et al., 2018). In the case of NLP, recent work has also shown the benefits of this type of sharing between tasks and domains Dou et al. (2019); Gu et al. (2018); Qian and Yu (2019). Although meta-learning for cross-lingual transfer has been investigate in the context of machine translation Gu et al. (2018), this paper – to best of our knowledge – is the first attempt to study meta-learning effect for natural language understanding tasks. In this work, we investigate cross-lingual meta-learning using two challenging evaluation setups, namely, i) few-shot learning, where only a limited amount of training data is available for the target domain, and ii) zero-shot learning, where no training data is available for the target domain. Specifically, in Section 4, we evaluate the performance of our model on two natural language understanding tasks: cross-lingual natural language inference in XNLI corpus (Conneau et al., 2018), and cross-lingual question answering in X-WikiRE dataset (Levy et al., 2017; Abdou et al., 2019).
To summarize, the contribution of our model (detailed in Section 3) is four-fold. In particular, we {enumerate*}[label=()]

exploit the use of meta-learning methods for two different natural language understanding tasks (i.e., natural language inference, question answering),

evaluate the performance of the proposed architecture on various scenarios: cross-domain, cross-lingual, standard supervised as well as zero-shot, across a total of 16 languages (i.e., 15 languages in XNLI and 4 languages in X-WikiRE datasets),

evaluate the consistent improvement of our cross-lingual meta-learning architecture (X-MAML) over several training scenarios (i.e., improvement over state-of-the-art on zero-shot cross-lingual NLI, as well as zero-shot and few-shot QA),

perform an extensive error analysis which reveals that cross-lingual trends can partially be explained by typological commonality between languages.

2 Meta Learning

Figure 1: MAML in a supervised learning

Meta-Learning tries to tackle the problem of fast adaptation to a handful of new training data instances. It discovers the structure among multiple tasks such that learning new tasks can be done quickly. This is done by repeatedly simulating the learning process on low-resource tasks using many high-resource ones (Gu et al., 2018). There are several ways of performing meta-learning: (i) metric-based Koch et al. (2015); Vinyals et al. (2016), (ii) model-based Santoro et al. (2016), and (iii) optimisation-based Finn et al. (2017); Ravi and Larochelle (2017); Nichol et al. (2018). Metric-based methods aim to identify similarities or distances measure between instances. For model-based architectures, the focus has been on adapting models that learn fast (e.g., memory networks) for meta learning Santoro et al. (2016). In this work, we focus on optimisation-based methods due to their superiority in several tasks (e.g., computer vision Finn et al. (2017)) over the above-mentioned meta-learning architectures. These optimisation-based methods are able to find good initialisation parameter values and adapt to new tasks quickly. To the best of our knowledge, we are the first to exploit the idea of meta-learning for transferring zero-shot knowledge in a cross-lingual setting for natural language understanding, in particular for the tasks of natural language inference (NLI) and question answering (QA). Specifically, we exploit the usage of Model Agnostic Meta-Learning (MAML) which uses gradient descent and achieves a good generalisation for a variety of tasks (Finn et al., 2017). MAML is able to quickly adapt to new target tasks by using only a few instances at test time, assuming that these new target tasks are drawn from the same distribution.

Formally, MAML (see Figure 1) assumes that there is a distribution of tasks . The parameters of model M for a particular task , sampled from the distribution , are updated to . In particular, the parameter vector is updated using one or a few iterations of gradient descent steps on the training examples (i.e., ) of task . For example, for one gradient update,


where is the step size, the is the learnt model from the neural network and is the loss on the specific task . The parameters of the model are trained to optimise the performance of on the unseen test examples (i.e., ) across tasks . The meta-learning objective is:


algocf[t]     \end@dblfloat The MAML algorithm aims to optimise the model parameters via a few number of gradient steps on a new task, which we refer to as the meta-update. The meta-update across all involved tasks is performed for the parameters of the model using stochastic gradient descent (SGD) as:


where is the meta-update step size.

3 Cross-Lingual Meta-Learning

The underlying idea of using MAML in NLP downstream tasks Gu et al. (2018); Dou et al. (2019); Qian and Yu (2019) is to employ a set of high-resource auxiliary tasks/languages to find an optimal initialisation from which learning a target task/language can be done using only a small number of training instances. In a cross-lingual setting (i.e, XNLI , X-WikiRE), where only an English dataset is available as a high-resource language, and a small number of instances are available for other languages, the training procedure for MAML requires some changes. For this purpose, we introduce a cross-lingual meta-learning framework (henceforth X-MAML), which uses the following training steps:

  1. Pre-training on the high-resource language h i.e., English). Given all the training samples in the high-resource language h, we first train the model M on h to initialise the model parameters .

  2. Meta-learning using low-resource languages. This step consists of choosing one or more auxiliary languages from the low-resource set. Using the development set of each auxiliary language, we construct a set of randomly sampled batches of task . Then, we update the model parameters using data points of by one gradient descent step (Eq. 1). After this step, we can calculate the loss value using examples in each task. It should be noted that the data used as is different from that used as . We sum up the loss value from all tasks to minimise the meta objective function and to perform meta-update using Eq. 3. This step is performed in multiple iterations.

  3. Zero-shot or few-shot learning on the target languages. In the last step of X-MAML, we first initialise the model parameters with those learned during meta-learning. We then continue by evaluating the model on the test set of target languages (i.e., zero-shot learning) or to fine-tune the model parameters with standard supervised learning using the development set of target languages and evaluate on the test set (i.e., few-shot learning).

A more formal description of X-MAML is given in Algorithm LABEL:alg:X-MAML.

3.1 Natural Language Inference

Natural language inference (NLI) is the task of predicting whether a ‘hypothesis’ is true (entailment), false (contradiction), or undetermined (neutral) given a ‘premise’. The Multi-Genre Natural Language Inference (MultiNLI) dataset introduced by Williams et al. (2018) has 433k sentence pairs annotated with textual entailment information. It covers a range of different genres of spoken and written text, and supports a cross-genre evaluation. The NLI premise sentences are provided in 10 different genres: facetoface, telephone, verbatim, state, government, fiction, letters, nineeleven, travel and oup. All of the genres appear in the test and development sets, but only five are included in the training set. To verify our learning routine more generally, we define as a NLI task in each genre. We exploit MAML, in its original setting, to investigate whether meta-learning encourages the model to learn a good initialisation for all target genres, which can then be fine-tuned with limited supervision for each genre’s development instances (2,000 examples) to achieve a good performance on its test set.

The Cross-lingual Natural Language Inference (XNLI) dataset Conneau et al. (2018) consists of 5,000 test and 2,500 dev hypothesis-premise pairs with their textual entailment for English. The translation of pairs are provided in 14 languages: French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw) and Urdu (ur). The XNLI corpus provides a multilingual benchmark to evaluate how to perform inference in low-resource languages such as Swahili or Urdu, in which only English as high-resource language data is available from MultiNLI Williams et al. (2018) for training. Following our X-MAML framework, we study the impact of meta-learning with one low-resource language to provide auxiliary tasks in the performance of a cross-lingual NLI model on the other target languages provided in the XNLI datase. The performance in NLI benchmarks is evaluated by accuracy on the test set.

3.2 Question Answering (QA)

Levy et al. (2017) frame the Relation Extraction (RE) task as reading compression / question answering using natural language question templates. For example, a relation type like author is transformed to at least one language question templates such as who is the author of x?, where x is an entity.

Following this idea, Abdou et al. (2019) introduce a new dataset for multilingual reading comprehension-based relation extraction in five languages (i.e, English, French, Spanish, Italian and German). Each instance in the dataset contains a question, a context and an answer. The question is a transformation of a target relation and the context may contain the answer. If the answer is not present, it will be marked as NIL. In the QA task, we evaluate the performance of our method on the UnENT setting of the X-WikiRe dataset (i.e., the goal is to generalise to unseen entities). The evaluation methodology follows Kundu and Ng (2018), where exact match (EM) and F1 scores are computed for questions with valid answers.

4 Experiments

We conduct experiments on the MultiNLI, XNLI and X-WikiRE datasets. We report results for few-shot as well as zero-shot cross-domain and cross-lingual learning. To examine the model- and task-agnostic features of MAML and X-MAML, we conduct experiments with various models and tasks.

4.1 Experimental Setup

We implement meta-learning in X-MAML with the ‘higher’ library1. We use the Adam optimiser Kingma and Ba (2014) with a batch size of 32 for both meta-learning and few-shot learning. Given the Algorithm LABEL:alg:X-MAML, the step size and learning rate are set to and , respectively. We train for as meta-learning iterations (lines 3-12) in X-MAML. However, iterations provide the best results in our experiments. The sample sizes and in X-MAML (lines 7 and 6) are equal to 16 for each dataset. We report results for each experiment by taking the average over ten different runs.

NLI task:

For the NLI task, we have two different settings. In MultiNLI as a cross-genre dataset, we employ the Enhanced Sequential Inference Model (ESIM) Chen et al. (2016). ESIM employs LSTM networks with deep attention fusion that links the current word to previous word stored in the memory. It introduces local inference modeling, which models the inference relationship between premise and hypothesis after the two fragments aligned locally. On the other hand, in the cross-lingual NLI task (XNLI), we employ the PyTorch version of BERT 2 described in Devlin et al. (2018) as the underlying model M. However, since our proposed meta-learning method is model-agnostic, it can easily be extended to any other cross-lingual architectures. In the cross-genre NLI we apply MAML, whereas in the cross-lingual setting, we apply X-MAML on the original English BERT model (En-BERT) as well as Multilingual BERT (Multi-BERT). As the First step in X-MAML for XNLI, we fine-tune En-BERT and Multi-BERT on the MultiNLI dataset (English) to obtain the initial model parameters for each experiment .

Baseline MAML Acc.
1 38.60 49.78 50.92
2 37.80 48.58 50.66
3 47.09 51.40 52.85
5 49.88 52.22 51.40
10 51.02 52.51 53.95
20 59.14 61.38 58.16
50 63.37 63.85 61.74
100 64.35 64.99 64.61
Table 1: Test accuracy with different settings of MAML on the MultiNLI dataset. x%: the percentage of training samples, Baseline: The test accuracy of trained ESIM using x% of training data, MAML: The test accuracy of ESIM after meta-learning steps, : 5 tasks are defined in MAML using the training set, : 10 tasks are included in MAML using the development set.
Figure 2: Gains and losses in test accuracies for Zero-shot X-MAML on XNLI using Multi-BERT. Rows correspond to the evaluated language as target. Columns are auxiliary languages used in X-MAML.Numbers on the off-diagonal indicate change over the test accuracy after X-MAML compare to the baseline in the same row. Yellow to blue indicates stronger improvement over the baseline with X-MAML

QA task:

We use the Nil-Aware Answer Extraction Framework (NAMANDA) 3, proposed by Kundu and Ng (2018), as the base model M in X-MAML for our QA experiments. NAMANDA encodes the question and context sequences to computes a similarity matrix. It creates evidence vectors through joint-encoding of the question and context and applying the multi-factor self-attentive encoding. Lastly, the evidence vectors are decomposed to output an answer or NIL. We set the parameters similar to the values as in the original work for the training and evaluation phases. The NAMANDA model M is pre-trained on the full English training set (1M instances - see step 1 in our training algorithm). The model M is further being used by our meta-learning step to adapt the pre-trained QA model. We then evaluate how well the English model has been adapted by each of the auxiliary language through X-MAML via performing either few- or zero-shot learning. In few-shot X-MAML, the meta-learned M in fine-tuned on the development set (1K instances) of other languages (i.e, fr, es, it and de). For both few- and zero-shot learning, we evaluate on the 10K test set of each of the target languages. Following the work of Abdou et al. (2019), Multi-BERT model is used to jointly encode texts for different languages in the QA model.


We create our baselines in two modes: (i) zero-shot baselines: by evaluating the underlying model in each task on the test set of the target languages; (ii) few-shot baselines: fine-tune the model on the dev set and evaluate on the test set of the low-resource languages.

en fr es de el bg ru tr ar vi th zh hi sw ur avg
Zero-shot cross-lingual transfer
Multi-BERT (Devlin) 81.4 - 74.3 70.5 - - - - 62.1 - - 63.8 - - 58.35 -
Multi-BERT (Wu) 82.1 73.8 74.3 71.1 66.4 68.9 69.0 61.6 64.9 69.5 55.8 69.3 60.0 50.4 58.0 66.3
Multi-BERT (Our baseline) 81.36 73.45 73.85 69.74 65.73 67.82 67.94 59.04 64.63 70.12 52.46 68.90 58.56 47.58 58.70 65.33
X-MAML (AVG) 81.69 73.86 74.43 71.00 67.16 68.39 68.90 60.41 65.33 70.95 54.08 70.09 60.51 47.97 59.94 -
X-MAML (MAX) 82.09 74.42 75.07 71.83 67.95 69.45 70.19 61.20 66.05 71.82 55.39 71.11 62.20 49.76 61.51 67.33
X-MAML () 81.88 74.17 74.81 71.59 67.95 68.86 69.44 60.93 65.86 71.57 55.26 70.59 - 47.12 61.51 -
Few-Shot learning
Multi-BERT (Our baseline) 81.94 75.39 75.79 73.25 69.54 71.60 70.84 64.85 67.37 73.23 61.18 73.93 64.37 57.82 63.71 69.65
X-MAML (AVG) 82.22 75.24 76.06 73.34 69.97 71.80 71.28 64.76 67.82 73.41 61.57 74.02 64.83 58.02 63.66 -
X-MAML (MAX) 82.39 75.32 76.18 73.46 70.03 71.94 71.45 64.92 67.95 73.52 61.74 74.21 64.97 58.23 63.81 70.01
X-MAML () 82.24 75.31 75.94 73.34 69.98 71.77 71.31 64.89 67.87 73.38 61.5 73.99 64.94 - 63.63 -
Machine translate at test (TRANSLATE-TEST)
Multi-BERT (Devlin) 81.4 - 74.9 74.4 - - - - 70.4 - - 70.1 - - 62.1 -
Machine translate at training (TRANSLATE-TRAIN)
Multi-BERT (Wu) 82.1 76.9 78.5 74.8 72.1 75.4 74.3 70.6 70.8 67.8 63.2 76.2 65.3 65.3 60.6 71.6

Table 2: Test accuracies on the XNLI cross-lingual natural language inference dataset for zero-shot and few-shot X-MAML. Columns indicate the evaluated languages. Multi-BERT (Devlin) is the original model introduced in Devlin et al. (2018); Multi-BERT (Wu) is reported by Wu and Dredze (2019) where they reproduce Multi-BERT on all languages. TRANSLATE-TEST is to machine translate target language test data to English, the fine-tuning on English; TRANSLATE-TRAIN is to machine translate English training data to the target language and fine-tune with this translated data; Multi-BERT in TRANSLATE-TRAIN setup outperforms our Few-shot X-MAML, however, we only use 2k development set of the target languages whereas they explicitly use 433k translated sentences in the fine-tuning phase. Multi-BERT (Our baseline) in Zero-shot: Evaluating the pre-trained model on the test set of target languages in the column; Multi-BERT (Our baseline) in Few-shot: fine-tune the model on the development set and evaluate on the test set of the target language. avg: corresponds to the row-wise average accuracy.

4.2 Few-Shot Cross-Domain NLI

We train ESIM on the MultiNLI training dataset to provide initial model parameters (step 1). We evaluate the pre-trained model on the English test set of XNLI (since the MultiNLI test set is not publicly available) to set the baseline for this scenario. Since MultiNLI is already split into genres we use each genre as a task within MAML. To define tasks based on each genre, we include either the training set (5 genres) or the development set (10 genres) during meta-learning.

In the last phase of learning (step 3), we first initialise the model parameters with those learned by MAML. We then continue to fine-tune the model using the development set of MultiNLI and report the accuracy on the English test set of XNLI. We randomly sample of the training data and evaluate the model. The results obtained by training on the of MultiNLI datset using ESIM (as the learner model M) are shown in Table 1. We find that for both settings, MAML on training (5 tasks) and on development (10 tasks), the meta-learning step improves the performance of the model as more supervision signals (i.e., more training data) become available. However, as demonstrated by our experimental study the effectiveness of the MAML is larger when very limited training data is available.

Auxiliary lang. Baseline Abdou et al. (2019)
es fr it de BERT fastText


es - 49.01 50.11 50.59 49.85 5.49 16.17
fr 52.20 - 52.13 51.96 51.72 17.42 15.28
it 50.53 50.65 - 50.58 50.58 10.70 4.44
de 49.92 48.78 48.63 - 48.98 2.87 14.09
1K 1K 10K 1K 10K


es - 78.09 78.33 77.89 78.26 42.97 71.66 65.78 77.99
fr 80.68 - 80.81 80.74 80.67 42.69 72.43 65.67 74.15
it 82.04 81.76 - 81.77 81.78 56.25 80.06 64.02 83.45
de 78.29 78.48 78.66 - 78.63 56.01 70.43 62.47 72.17
Table 3: F1 scores (average of 10 runs) for the test set of UnENT section in the X-WikiRE dataset using zero- and few-shot leaning X-MAML. We perform the few-shot X-MAML with only 1K from development set. Baseline in Zero-shot: evaluating the pre-trained NAMANDA on the test set of the target language in the row; Baseline in Few-shot: fine-tune the model on the development set and evaluate on the test set of the target language.

4.3 Zero- and Few-Shot Cross-Lingual NLI

Zero-Shot Learning.

In this set of experiments, we conduct our X-MAML within a zero-shot setup, in which we do not fine-tune after the meta-learning step. Table 4 (See Appendix) shows the average accuracy over 10 runs of X-MAML on the XNLI dataset using Multi-BERT as the base model. Each column corresponds to the performance of the Multi-BERT system after meta-learning with one auxiliary language, and evaluation on the target language of the XNLI test set. Overall, we observe that our zero-shot approach with X-MAML outperforms the baselines without MAML, as well as results reported by Devlin et al. (2018), thus improving a state-of-the-art model for zero-shot cross-lingual NLI. We hypothesise that the degree of typological commonalities among the languages has an effect (i.e., positive or negative) on the performance of X-MAML. We report the impact of meta-learning for each target language as a difference between accuracy with and without (i.e., baseline) X-MAML on the test set (Figure 2). It can be observed that the proposed learning approach provides positive impacts across most of the target languages. However, including Swahili (sw) as an auxiliary in X-MAML does not show similar effects on the other target languages.

In Table 2, we compare to the original baseline reported in Devlin et al. (2018)4 and Wu and Dredze (2019). We report the average and maximum of each row corresponding to the target languages. Hindi is the most effective auxiliary language for meta-learning in the zero-shot setting, as shown in Table 2. It is perhaps due to the numbers of typological similarities between Hindi and other languages in the XNLI dataset. We also observe an improvement in accuracy by performing X-MAML on cross-lingual NLI using En-BERT (See Appendix: Table 7 and Figure 3). However, English as an auxiliary shows negative impact (i.e., decreasing in the performance) in most of the cases. On the other hand, using any language as an auxiliary does not lead to any improvement on the English test dataset.

Few-Shot Learning.

A similar approach is followed for few-shot learning. However, the meta-learning via X-MAML in step 3 is carried out by fine-tuning on the development set (2.5k) of target languages, and then by evaluating on the test set. The output of these experiments (see Appendix: Tables 5, 6 and Figures 5) shows enhancements compared to the baseline accuracies. As presented in Table 2, the proposed learning steps alleviate the machine translation step (TRANSLATE-TEST) from the foreign language into English in the Multi-BERT setting. The results show that X-MAML boosts Multi-BERT performance on cross-lingual NLI. We also report the output of using Swahili (sw) as the most effective auxiliary language in X-MAML using Multi-BERT in few-shot learning mode.

4.4 Zero- and Few-Shot Cross-Lingual QA

A similar approach is followed for cross-lingual QA task on the X-WikiRE dataset. Table 3 shows the result of both zero- and few-shot X-MAML for the UnENT part of X-WikiRE datset. We compare our results for the UnENT scenario on the X-WikiRE dataset to those reported on the original paper. It can be observed that any of the target datasets benefit from at least one of the auxiliary languages by adapting the model using X-MAML. We were not able to directly reproduce the result for the zero-shot scenario of the original paper thus we set our own baseline for the task. However, we find (i) our zero-shot results with X-MAML improve on those without meta-learning (i.e., baselines) and (ii) we outperform Abdou et al. (2019) for the UnENT scenario of zero-shot cross-lingual QA. Furthermore, for the few-shot scenario, adapting the QA model using few-shot X-MAML with only 1K dev data outperforms their cross-lingual transfer model where they explicitly use 10K in the fine-tuning phase.

5 Related Work

The main motivation for this work is the low availability of labelled training datasets for most of the world’s languages. To alleviate this issue, a number of methods, including so-called ‘few-shot learning’ approaches have been proposed. Few-shot learning methods have initially been introduced within the area of image classification (see \newcitevinyals:16,ravi:17,finn:17), but have recently also been applied to NLP tasks such as relation extraction Han et al. (2018), text classification Yu et al. (2018) and machine translation Gu et al. (2018). Specifically, in NLP, these few-shot learning approaches include either (i) the transformation of the problem into a different task (e.g., relation extraction is transformed to question answering Abdou et al. (2019); Levy et al. (2017)) or (ii) meta-learning (see \newciteandrychowicz:16,finn:17).

5.1 Meta-Learning

Meta-learning or learning-to-learn has recently received a lot of attention from the NLP community. The first-order MAML has been applied in the task of machine translation Gu et al. (2018), where they propose low-resource translation as a meta-learning problem and learn to adapt to target languages based on multilingual high-resource language tasks. However, in the proposed framework, they include eighteen high-resource languages as auxiliary languages and five diverse low-resource languages as target languages, whereas we only assume access to English as rich-resource language in our target tasks. For the task of dialogue generation, Qian and Yu (2019) address the domain adaption using the meta-learning approach. Dou et al. (2019) explored the MAML and its variants for low-resource NLU tasks in the GLUE dataset Wang et al. (2018). They consider different high-resource NLU tasks such as MultiNLI Williams et al. (2018) and QNLI Rajpurkar et al. (2016) as auxiliary to learn meta parameters using MAML . Then, they fine-tuned the low-resource tasks using the adapted parameters from the meta-learning phase.

All the aforementioned works on meta-learning in NLP assume that there are multiple high-resource tasks or languages, which are then adapted to new target tasks or languages with a handful of training samples. However, in a cross-lingual natural language inference and question answering setting, the available rich-resource language is usually only English. Our work thus fills an important gap in the literature, as we only require a single source language.

5.2 Cross-Lingual Natural Language Understanding

Cross-lingual learning has a fairly short history in natural language processing, and has mainly been restricted to traditional NLP tasks, such as PoS tagging and parsing. In contrast, there has been relatively little work on cross-lingual natural language understanding, partly due to a lack of benchmark datasets. Existing work has mainly been on natural language inference Conneau et al. (2018); Agic and Schluter (2018), and to a lesser degree on relation extraction Verga et al. (2016); Faruqui and Kumar (2015) and question answering Abdou et al. (2019). Previous research generally reports that cross-lingual learning is challenging and that it is hard to beat a machine translation baseline (e.g., Conneau et al. (2018)). Such a baseline is for instance suggested by \newcitefaruqui2015multilingual, where the text in the target language is automatically to English, following which the target task (in this case relation extraction) is performed, following which cross-lingual projections are employed to translate back to the source language. We achieve competitive performance compared to a MT baseline (for XNLI and X-WikiRE), and propose a method that requires no training instances for the target task in the target language. Furthermore, our method is model agnostic, and can be used to extend any pre-existing model, such as that introduced by Conneau and Lample (2019).

6 Discussion and Analysis

6.1 Cross-Lingual Transfer

Somewhat surprisingly, we find that cross-lingual transfer with meta-learning yields improved results even when languages strongly differ from one another. For instance, in the case of zero-shot meta learning on XNLI, we observe gains for almost all auxiliary languages, with the exception of Swahili (sw). This indicates that the meta-parameters learned with X-MAML are sufficiently language agnostic, as we otherwise would not expect to see any benefits in transferring from, e.g., Russian to Hindi (one of the strongest results in Figure 2). This is dependent on having access to a pre-trained multilingual model such as BERT, however, using monolingual BERT (En-BERT) yields overwhelmingly positive gains in some target/auxiliary settings (Figure 3 in the Appendix). In the case of few-shot learning, our findings are similar, as almost all combinations of auxiliary and target languages leads to improvements when using multilingual BERT (Figure 4 in the Appendix). However, when we have access to a handful of training instances as in few-shot learning, even the English BERT model mostly leads to improvements in this setting (Figure 5 in the Appendix).

6.2 Typological Correlations

In order to better explain our results in cross-lingual zero-shot and few-shot learning, we investigate typological features, and their overlap between main and auxiliary languages. We evaluate on the World Atlas of Language Structure (WALS, Dryer and Haspelmath (2013), which is the largest openly available typological database. It comprises approximately 200 linguistic features with annotations for more than 2,500 languages, which have been made by expert typologists through study of grammars and field work. We draw inspiration from Bjerva and Augenstein (2018a, b), who attempt to predict typological features based on language representations learned under various NLP tasks. Similarly, we experiment with two conditions: (i) we attempt to predict typological features based on the mutual gain/loss in performance using X-MAML; (ii) we investigate whether sharing between two typologically similar languages is beneficial for performance using X-MAML. We train one simple logistic regression classifier per condition above, for each WALS feature. In the first condition (i), the task is to predict the exact WALS feature value of a language, given the change in accuracy in combination with other languages. In the second condition (ii), the task is to predict whether a main and auxiliary language have the same WALS feature value, given the change in accuracy when the two languages are used in X-MAML. We compare with two simple baselines, one based on always predicting the most frequent feature value in the training set, and one based on predicting feature values with respect to the distribution of feature values in the training set. We then investigate whether any features could be consistently predicted above baseline levels, given different test-training splits. We apply a simple paired t-test to compare our models predictions to the baselines. As we are running a large number of tests (one per WALS feature), we apply Bonferroni correction, changing our cut-off -value from to . We first investigate few-shot X-MAML, when using multilingual BERT, as reported in Table 5 (Appendix). In this case, we find that languages sharing the feature value for WALS feature 67A The Future Tense is beneficial. This feature encodes whether or not a language has an inflectional marking of future tense, and can be considered to be a morphosyntactic feature. We next look at the case of zero-shot X-MAML with multilingual BERT, as reported in Table 4 (Appendix). In this case, we find that languages sharing a feature value for the WALS feature 25A Locus of Marking: Whole-language Typology typically help each other. This feature describes whether the morphosyntactic marking in a language is on the syntactic heads or dependents of a phrase. For example en, de, ru, and zh are ”Dependent-marking” in this feature. And if we look at the results in Figure 2, they have the most mutual gains from each other during the zero-shot X-MAML. In both cases, we thus find that that languages with similar morphosyntactic properties can be beneficial to one another when using X-MAML.

7 Conclusions

In this work, we show that meta-learning allows one to leverage training data from an auxiliary language, to perform zero-shot and few-shot cross-lingual transfer. We evaluated this on two challenging natural language understanding tasks (NLI and QA), and on a total of 16 languages. We are able to improve the performance of state-of-the-art baseline models for zero-shot XNLI, as well as both few-shot and zero-shot QA task on the X-Wiki-Re dataset. Furthermore, we show in a typological analysis, that languages which share certain morphosyntactic features tend to benefit from this type of transfer. Future study will extend this work to other cross-lingual NLP tasks and more languages.

8 Acknowledgements

This research has received funding from the Swedish Research Council under grant agreement No 2019-04129, as well as the Research Foundation - Flanders (FWO). This work was also funded by UiO: Energy to support international mobility. We are grateful to the Nordic Language Processing Laboratory (NLPL) for providing access to its supercluster infrastructure.

Appendix A Appendix

Figure 3: Gain/Loss in Zero-shot X-MAML using En-BERT
Figure 4: Gain/Loss in Few-shot X-MAML using Multi-BERT
Figure 5: Gain/Loss in Few-shot X-MAML using En-BERT
Auxiliary lang. baseline
ar bg de el en es fr hi ru sw th tr ur vi zh
ar - 65.76 65.48 66.05 64.41 65.27 65.24 65.86 65.31 63.66 65.25 65.58 65.56 65.84 65.32 64.63
bg 68.36 - 68.79 68.39 67.95 68.45 68.80 68.86 69.41 66.10 67.62 67.95 68.63 68.67 69.45 67.82
de 70.88 71.46 - 71.26 71.09 71.12 71.11 71.59 71.83 68.65 70.29 70.37 71.42 71.15 71.83 69.74
el 67.53 67.58 67.25 - 66.11 67.13 67.39 67.95 67.71 65.11 67.12 67.15 67.69 67.19 67.34 65.73
en 81.68 81.79 82.02 81.77 - 81.88 81.91 81.88 82.03 80.44 81.18 81.43 81.80 81.73 82.09 81.36
es 74.48 74.51 74.63 74.58 74.41 - 74.95 74.81 74.63 72.66 73.91 74.12 74.51 74.71 75.07 73.85
fr 74.13 74.02 74.22 74.11 73.75 74.18 - 74.17 74.34 71.87 73.04 73.41 74.15 74.21 74.42 73.45
hi 60.75 61.59 60.84 60.61 59.31 60.18 60.66 - 61.75 57.10 59.39 60.47 62.20 60.76 61.56 58.56
ru 68.78 69.47 69.47 68.93 68.64 68.89 69.25 69.44 - 66.11 68.18 68.72 69.52 69.02 70.19 67.94
sw 48.71 48.53 47.36 49.13 46.70 48.43 47.81 47.11 47.28 - 49.20 49.76 46.61 48.43 46.50 47.58
th 54.65 55.39 53.80 54.98 51.14 54.09 54.15 55.26 53.82 52.90 - 55.24 53.79 54.99 52.85 52.46
tr 60.94 61.20 60.22 61.09 58.66 60.60 60.32 60.93 60.29 59.98 60.53 - 60.82 60.68 59.47 59.04
ur 60.30 60.87 60.34 60.20 58.82 59.81 60.12 61.51 61.02 56.37 59.38 60.02 - 59.87 60.46 58.70
vi 71.27 71.56 71.32 71.14 70.35 71.22 71.42 71.57 71.73 68.11 69.87 70.53 71.43 - 71.82 70.12
zh 70.24 70.68 70.65 70.12 69.91 70.29 70.47 70.59 71.11 67.47 69.33 69.50 70.29 70.54 - 68.90
Table 4: Average of test accuracies on XNLI across 10 runs with zero-shot learning and X-MAML, using multilingual BERT. The auxiliary language is not included in the evaluation phase.
Auxiliary lang. baseline
ar bg de el en es fr hi ru sw th tr ur vi zh
ar - 67.84 67.73 67.85 67.62 67.84 67.80 67.81 67.85 67.87 67.86 67.83 67.71 67.89 67.95 67.37
bg 71.79 - 71.76 71.80 71.72 71.77 71.80 71.74 71.94 71.77 71.78 71.78 71.77 71.79 71.92 71.60
de 73.36 73.23 - 73.37 73.30 73.30 73.33 73.46 73.27 73.34 73.38 73.32 73.37 73.34 73.43 73.25
el 69.95 69.98 69.97 - 69.94 69.99 69.91 69.93 69.95 69.98 70.03 70.02 69.90 69.95 70.03 69.54
en 82.24 82.21 82.13 82.22 - 82.15 82.27 82.26 82.24 82.24 82.19 82.39 82.25 82.14 82.20 81.94
es 76.07 76.12 76.14 76.02 76.06 - 76.18 76.14 76.10 75.94 76.03 75.91 76.10 76.00 76.09 75.79
fr 75.32 75.23 75.16 75.24 75.23 75.18 - 75.19 75.22 75.31 75.28 75.19 75.28 75.19 75.28 75.39
hi 64.95 64.82 64.78 64.89 64.64 64.63 64.90 - 64.87 64.94 64.73 64.84 64.79 64.97 64.83 64.37
ru 71.19 71.27 71.17 71.33 71.19 71.19 71.33 71.28 - 71.31 71.34 71.45 71.18 71.29 71.38 70.84
sw 58.14 58.23 57.95 57.99 57.53 57.97 57.94 58.10 58.04 - 58.00 58.22 58.08 58.01 58.09 57.82
th 61.59 61.64 61.57 61.71 61.40 61.51 61.51 61.68 61.54 61.50 - 61.58 61.41 61.56 61.74 61.18
tr 64.74 64.79 64.69 64.82 64.59 64.82 64.76 64.83 64.70 64.89 64.92 - 64.74 64.73 64.66 64.85
ur 63.67 63.58 63.69 63.63 63.55 63.63 63.68 63.61 63.72 63.63 63.72 63.81 - 63.67 63.60 63.71
vi 73.51 73.52 73.46 73.35 73.36 73.29 73.39 73.31 73.51 73.38 73.39 73.41 73.42 - 73.41 73.23
zh 74.04 73.97 74.02 74.02 73.74 74.01 74.02 74.10 74.11 73.99 74.01 74.21 74.06 73.95 - 73.93
Table 5: Average of test accuracies on XNLI across 10 runs with Few-Shot X-MAML , using Multi-BERT, The auxiliary language is not included in the evaluation phase
Auxiliary lang. baseline
ar bg de el en es fr hi ru sw th tr ur vi zh
ar - 50.44 51.02 50.56 50.19 51.06 50.98 50.47 50.44 50.69 50.13 50.91 50.05 50.63 50.46 50.38
bg 51.68 - 52.77 51.35 50.22 52.70 52.94 51.27 52.37 51.66 51.02 52.33 51.31 51.95 51.61 51.44
de 56.03 55.97 - 56.03 56.13 56.92 57.23 56.02 55.93 56.42 56.04 56.60 55.67 56.48 56.28 55.09
el 47.00 47.05 48.25 - 46.52 48.04 48.81 47.14 47.09 47.74 46.97 47.84 47.00 47.81 47.23 50.48
en 84.96 84.96 84.97 85.02 - 84.98 85.02 84.97 84.98 84.99 84.91 85.01 84.97 84.96 84.92 85.35
es 61.04 61.01 61.11 60.98 60.93 - 61.09 61.02 61.03 60.79 60.75 61.02 61.02 61.06 60.97 60.74
fr 61.33 61.27 61.48 61.28 61.33 61.70 - 61.29 61.30 61.31 61.19 61.42 61.28 61.37 61.41 60.60
hi 47.63 47.56 48.24 47.28 46.35 48.14 48.45 - 47.82 47.43 46.51 47.96 47.38 47.73 47.77 49.06
ru 51.90 52.16 52.46 51.89 51.49 52.50 52.52 51.62 - 51.86 51.51 52.07 51.71 51.94 51.83 52.12
sw 49.42 49.21 50.24 49.25 48.70 50.10 50.46 49.14 49.38 - 49.14 49.54 49.29 49.95 49.37 49.82
th 37.38 37.34 37.54 37.41 37.53 37.58 37.53 37.44 37.38 37.60 - 37.60 37.41 37.57 37.48 37.13
tr 53.41 53.47 53.79 53.33 53.28 53.85 53.80 53.38 53.47 53.54 53.28 - 53.23 53.58 53.43 50.68
ur 48.09 48.06 48.86 48.17 46.68 48.60 48.98 47.89 48.38 47.23 47.61 48.26 - 48.18 48.01 44.45
vi 56.20 56.33 56.62 56.17 56.24 56.61 56.75 56.15 56.27 56.19 56.10 56.07 56.13 - 56.23 56.91
zh 48.25 48.15 48.57 48.23 47.85 48.58 48.64 48.12 48.13 48.57 47.98 48.59 48.00 48.71 - 48.56
Table 6: Average of test accuracies on XNLI across 10 runs with Few-Shot X-MAML, using En-BERT (monolingual), The auxiliary language is not included in the evaluation phase.
Auxiliary lang. baseline
ar bg de el en es fr hi ru sw th tr ur vi zh
ar - 39.09 37.32 40.90 34.48 36.49 36.65 39.24 39.10 38.09 35.48 38.36 39.79 37.46 37.03 34.47
bg 42.33 - 38.29 41.92 35.17 37.55 37.58 40.04 38.93 38.32 36.37 38.72 40.90 37.81 37.41 35.23
de 41.88 42.77 - 41.59 37.68 46.41 46.43 40.90 42.70 44.89 39.42 45.70 41.05 45.03 40.30 38.52
el 40.08 38.50 38.70 - 35.18 37.65 37.80 40.15 38.72 39.42 35.91 39.82 41.06 38.73 37.99 35.15
en 81.95 81.87 81.89 82.22 - 82.12 82.05 82.23 81.88 82.23 82.52 82.01 82.03 82.29 82.27 83.45
es 47.41 47.64 53.24 46.59 42.86 - 53.18 45.79 47.56 51.10 44.69 52.04 46.30 50.83 45.87 43.95
fr 45.55 46.40 49.81 44.81 40.08 49.89 - 44.14 46.30 48.05 42.13 48.54 44.24 48.67 43.58 41.04
hi 39.61 38.91 36.91 39.32 34.46 36.87 36.78 - 39.08 37.14 35.88 37.15 39.98 37.20 37.40 34.69
ru 41.87 38.73 39.10 41.98 35.05 38.02 38.13 40.73 - 38.89 36.11 39.51 41.12 38.51 37.69 35.09
sw 39.05 37.55 40.07 39.00 36.41 40.33 39.85 38.45 37.57 - 37.26 42.01 38.82 42.70 38.72 37.96
th 36.16 35.41 36.46 36.17 35.63 36.43 36.32 35.64 35.43 36.74 - 36.91 36.05 36.63 36.67 35.73
tr 39.33 37.62 41.44 39.42 37.34 42.07 41.26 38.83 37.63 44.12 38.23 - 38.97 43.42 39.86 38.84
ur 36.85 38.46 36.27 39.55 34.16 35.72 35.63 39.09 38.64 36.80 35.33 36.94 - 36.91 36.85 33.93
vi 41.85 39.35 42.97 41.62 38.53 43.85 42.52 40.53 39.38 45.46 39.89 45.11 41.63 - 41.84 40.72
zh 37.21 36.09 37.18 36.68 34.48 36.33 36.55 35.25 36.16 37.73 36.64 37.70 35.99 37.66 - 34.63
Table 7: Average of test accuracies on XNLI across 10 runs with Zero-Shot X-MAML, using En-BERT (monolingual), The auxiliary language is not included in the evaluation phase.




  1. Mostafa Abdou, Cezar Sas, Rahul Aralikatte, Isabelle Augenstein, and Anders Søgaard. 2019. X-WikiRE: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 265–274, Hong Kong, China. Association for Computational Linguistics.
  2. Zeljko Agic and Natalie Schluter. 2018. Baselines and Test Data for Cross-Lingual Inference. In LREC. European Language Resources Association (ELRA).
  3. Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989.
  4. Emily M Bender. 2019. The# BenderRule: On Naming the Languages We Study and Why It Matters.
  5. Johannes Bjerva and Isabelle Augenstein. 2018a. From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 907–916, New Orleans, Louisiana. Association for Computational Linguistics.
  6. Johannes Bjerva and Isabelle Augenstein. 2018b. Tracking Typological Features of Uralic Languages in Distributed Language Representations. In IWCLUL.
  7. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and combining sequential and tree LSTM for natural language inference. CoRR, abs/1609.06038.
  8. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7057–7067.
  9. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  11. Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. 2019. Investigating meta-learning algorithms for low-resource natural language understanding tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1192–1197, Hong Kong, China. Association for Computational Linguistics.
  12. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  13. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2019. Ethnologue: Languages of the world. Accessed: 2019-05-25.
  14. Manaal Faruqui and Shankar Kumar. 2015. Multilingual Open Relation Extraction Using Cross-lingual Projection. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1351–1356. Association for Computational Linguistics.
  15. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org.
  16. Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631, Brussels, Belgium. Association for Computational Linguistics.
  17. Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809.
  18. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
  19. Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop.
  20. Souvik Kundu and Hwee Tou Ng. 2018. A nil-aware answer extraction framework for question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4243–4252, Brussels, Belgium. Association for Computational Linguistics.
  21. Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
  22. Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999.
  23. Kun Qian and Zhou Yu. 2019. Domain adaptive dialog generation via meta learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2639–2649, Florence, Italy. Association for Computational Linguistics.
  24. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
  25. Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In International Conference on Learning Representations.
  26. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850.
  27. Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum. 2016. Multilingual Relation Extraction using Compositional Universal Schema. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 886–896. Association for Computational Linguistics.
  28. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
  29. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461.
  30. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  31. Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. CoRR, abs/1904.09077.
  32. Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1206–1215.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description