A Survey of Cross-lingual Word Embedding Models

A Survey of Cross-lingual Word Embedding Models

\nameSebastian Ruder \emailsebastian@ruder.io
\addrInsight Research Centre, National University of Ireland,
Galway, Ireland
Aylien Ltd., Dublin, Ireland \AND\nameIvan Vulić \emailiv250@cam.ac.uk
\addrLanguage Technology Lab, University of Cambridge, UK \AND\nameAnders Søgaard \emailsoegaard@di.ku.dk
\addrUniversity of Copenhagen, Copenhagen, Denmark
Abstract

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

\jairheading

652019569-63110/201708/2019 \ShortHeadingsA Survey Of Cross-lingual Word Embedding Models Ruder, Vulić, & Søgaard \firstpageno569

1 Introduction

In recent years, (monolingual) vector representations of words, so-called word embeddings [Mikolov2013a, Pennington2014] have proven extremely useful across a wide range of natural language processing (NLP) applications. In parallel, the public awareness of the digital language divide111E.g., http://labs.theguardian.com/digital-language-divide/, as well as the availability of multilingual benchmarks [<]Nivre et al., 2016a;>hovy2006ontonotes,sylak2015language, has made cross-lingual transfer a popular NLP research topic. The need to transfer lexical knowledge across languages has given rise to cross-lingual word embedding models, i.e., cross-lingual representations of words in a joint embedding space, as illustrated in Figure 1.

Cross-lingual word embeddings are appealing for two reasons: First, they enable us to compare the meaning of words across languages, which is key to bilingual lexicon induction, machine translation, or cross-lingual information retrieval, for example. Second, cross-lingual word embeddings enable model transfer between languages, e.g., between resource-rich and low-resource languages, by providing a common representation space. This duality is also reflected in how cross-lingual word embeddings are evaluated, as discussed in Section 10.

Figure 1: Unaligned monolingual word embeddings (left) and word embeddings projected into a joint cross-lingual embedding space (right). Embeddings are visualized with t-SNE.

Many models for learning cross-lingual embeddings have been proposed in recent years. In this survey, we will give a comprehensive overview of existing cross-lingual word embedding models. One of the main goals of this survey is to show the similarities and differences between these approaches. To facilitate this, we first introduce a common notation and terminology in Section 2. Over the course of the survey, we then show that existing cross-lingual word embedding models can be seen as optimizing very similar objectives, where the main source of variation is due to the data used, the monolingual and regularization objectives employed, and how these are optimized. As many cross-lingual word embedding models are inspired by monolingual models, we introduce the most commonly used monolingual embedding models in Section 3. We then motivate and introduce one of the main contributions of this survey, a typology of cross-lingual embedding models in Section 4. The typology is based on the main differentiating aspect of cross-lingual embedding models: the nature of the data they require, in particular the type of alignment across languages (alignment of words, sentences, or documents), and whether data is assumed to be parallel or just comparable (about the same topic). The typology allows us to outline similarities and differences more concisely, but also starkly contrasts focal points of research with fruitful directions that have so far gone mostly unexplored.

Since the idea of cross-lingual representations of words pre-dates word embeddings, we provide a brief history of cross-lingual word representations in Section 5. Subsequent sections are dedicated to each type of alignment. We discuss cross-lingual word embedding algorithms that rely on word-level alignments in Section 6. Such methods can be further divided into mapping-based approaches, approaches based on pseudo-bilingual corpora, and joint methods. We show that these approaches, modulo optimization strategies and hyper-parameters, are nevertheless often equivalent. We then discuss approaches that rely on sentence-level alignments in Section 7, and models that require document-level alignments in Section 8. In Section 9, we describe how many bilingual approaches that deal with a pair of languages can be extended to the multilingual setting. We subsequently provide an extensive discussion of the tasks, benchmarks, and challenges of the evaluation of cross-lingual embedding models in Section 10 and outline applications in Section 11. We present general challenges and future research directions in learning cross-lingual word representations in Section 12. Finally, we provide our conclusions in Section 13.

This survey makes the following contributions:

  1. It proposes a general typology that characterizes the differentiating features of cross-lingual word embedding models and provides a compact overview of these models.

  2. It standardizes terminology and notation and shows that many cross-lingual word embedding models can be cast as optimizing nearly the same objective functions.

  3. It provides a proof that connects the three types of word-level alignment models and shows that these models are optimizing roughly the same objective.

  4. It critically examines the standard ways of evaluating cross-lingual embedding models.

  5. It describes multilingual extensions for the most common types of cross-lingual embedding models.

  6. It outlines outstanding challenges for learning cross-lingual word embeddings and provides suggestions for fruitful and unexplored research directions.

2 Notation and Terminology

For clarity, we list all notation used throughout this survey in Table 1. We use bold lower case letters () to denote vectors, bold upper case letters () to denote matrices, and standard weight letters () for scalars. We use subscripts with bold letters () to refer to entire rows or columns and subscripts with standard weight letters for specific elements ().

Let be a word embedding matrix that is learned for the -th of languages where is the corresponding vocabulary and is the dimensionality of the word embeddings. We will furthermore refer to the word embedding of the -th word in language with the shorthand or if language is clear from context. We will refer to the word corresponding to the -th word embedding as where is a string. To make this correspondence clearer, we will in some settings slightly abuse index notation and use to indicate the embedding corresponding to word . We will use to index words based on their order in the vocabulary , while we will use to index words based on their order in a corpus .

Some monolingual word embedding models use a separate embedding for words that occur in the context of other words. We will use as the embedding of the -th context word and detail its meaning in the next section. Most approaches only deal with two languages, a source language and a target language .

Symbol Meaning
word embedding matrix
vocabulary
word embedding dimensionality
/ / word embedding of the -th word in language
word embedding of the i-th context word
word pertaining to embedding
corpus of words / aligned sentences used for training
the -th word in a corpus
source language
target language
/ learned transformation matrix between space of and
number of words used as seed words for learning
function mapping from source words to their translations
monolingual co-occurrence matrix in language
size of context window around a center word
cross-lingual co-occurrence matrix / alignment matrix
-th sentence in language
representation of -th sentence in language
-th document in language
representation of -th document in language
is kept fixed during optimization
is optimized before
Table 1: Notation used throughout this survey.

Some approaches learn a matrix that can be used to transform the word embedding matrix of the source language to that of the target language . We will designate such a matrix by and if the language pairing is unambiguous. These approaches often use source words and their translations as seed words. In addition, we will use as a function that maps from source words to their translation : . Approaches that learn a transformation matrix are usually referred to as offline or mapping methods. As one of the goals of this survey is to standardize nomenclature, we will use the term mapping in the following to designate such approaches.

Some approaches require a monolingual word-word co-occurrence matrix in language . In such a matrix, every row corresponds to a word and every column corresponds to a context word . then captures the number of times word occurs with context word usually within a window of size to the left and right of word . In a cross-lingual context, we obtain a matrix of alignment counts , where each element captures the number of times the th word in language was aligned with the -th word in language , with each row normalized to sum to .

Finally, as some approaches rely on pairs of aligned sentences, we use to designate sentences in source language with representations where . We analogously refer to their aligned sentences in the target language as with representations . We adopt an analogous notation for representations obtained by approaches based on alignments of documents in and : and with document representations and respectively where .

Different notations make similar approaches appear different. Using the same notation across our survey facilitates recognizing similarities between the various cross-lingual word embedding models. Specifically, we intend to demonstrate that cross-lingual word embedding models are trained by minimizing roughly the same objective functions, and that differences in objective are unlikely to explain the observed performance differences [Levy2017].

The class of objective functions minimized by most cross-lingual word embedding methods (if not all), can be formulated as follows:

(1)

where is the monolingual loss of the -th language and is a regularization term. A similar loss was also defined by \citeauthorUpadhyay2016 (2016). As recent work [Levy2014, Levy2015a] shows that many monolingual losses are very similar, one of the main contributions of this survey is to condense the difference between approaches into a regularization term and to detail the assumptions that underlie different regularization terms.

Importantly, how this objective function is optimized is a key characteristic and differentiating factor between different approaches. The joint optimization of multiple non-convex losses is difficult. Most approaches thus take a step-wise approach and optimize one loss at a time while keeping certain variables fixed. Such a step-wise approach is approximate as it does not guarantee to reach even a local optimum.222Other strategies such as alternating optimization methods, e.g. EM [dempster1977maximum] could be used with the same objective. In most cases, we will use a longer formulation such as the one below in order to decompose in what order the losses are optimized and which variables they depend on:

(2)

The underbraces indicate that the two monolingual loss terms on the left, which depend on and respectively, are optimized first. Note that this term decomposes into two separate monolingual optimization problems. Subsequently, is optimized, which depends on . Underlined variables are kept fixed during optimization of the corresponding loss.

The monolingual losses are optimized by training one of several monolingual embedding models on a monolingual corpus. These models are outlined in the next section.

3 Monolingual Embedding Models

The majority of cross-lingual embedding models take inspiration from and extend monolingual word embedding models to bilingual settings, or explicitly leverage monolingually trained models. As an important preliminary, we thus briefly review monolingual embedding models that have been used in the cross-lingual embeddings literature.

Latent Semantic Analysis (LSA)

Latent Semantic Analysis [Deerwester1990] has been one of the most widely used methods for learning dense word representations. LSA is typically applied to factorize a sparse word-word co-occurrence matrix obtained from a corpus. A common preprocessing method is to replace every entry in with its pointwise mutual information (PMI) [Church1990] score:

(3)

where counts the number of (co-)occurrences in the corpus . As for unobserved word pairs, , such values are often set to , which is also known as positive PMI.

The PMI matrix where is then factorized using singular value decomposition (SVD), which decomposes into the product of three matrices:

(4)

where and are in column orthonormal form and is a diagonal matrix of singular values. We subsequently obtain the word embedding matrix by reducing the word representations to dimensionality the following way:

(5)

where is the diagonal matrix containing the top singular values and is obtained by selecting the corresponding columns from .

Max-margin loss (MML)
\citeauthor

Collobert2008 (2008) learn word embeddings by training a model on a corpus to output a higher score for a correct word sequence than for an incorrect one. For this purpose, they use a max-margin or hinge loss333Equations in the literature slightly differ in how they handle corpus boundaries. To make comparing between different monolingual methods easier, we define the sum as starting with the -th word in the corpus (so that the first window includes the first word ) and ending with the -th word (so that the final window includes the last word ).:

(6)

The outer sum iterates over all words in the corpus , while the inner sum iterates over all words in the vocabulary. Each word sequence consists of a center word and a window of words to its left and right. The neural network, which is given by the function , consumes the sequence of word embeddings corresponding to the window of words and outputs a scalar. Using this max-margin loss, it is trained to produce a higher score for a word window occurring in the corpus (the top term) than a word sequence where the center word is replaced by an arbitrary word from the vocabulary (the bottom term).

Skip-gram with negative sampling (SGNS)

Skip-gram with negative sampling [Mikolov2013a] is arguably the most popular method to learn monolingual word embeddings due to its training efficiency and robustness [Levy2015a]. SGNS approximates a language model but focuses on learning efficient word representations rather than accurately modeling word probabilities. It induces representations that are good at predicting surrounding context words given a target word . To this end, it minimizes the negative log-likelihood of the training data under the following skip-gram objective:

(7)

is computed using the softmax function:

(8)

where and are the word and context word embeddings of word respectively. The skip-gram architecture can be seen as a simple neural network: The network takes as input a one-hot representation of a word and produces a probability distribution over the vocabulary . The embedding matrix and the context embedding matrix are simply the input-hidden and (transposed) hidden-output weight matrices respectively. The neural network has no nonlinearity, so is equivalent to a matrix product (similar to Equation 4) followed by softmax.

As the partition function in the denominator of the softmax is expensive to compute, SGNS uses Negative Sampling, which approximates the softmax to make it computationally more efficient. Negative sampling is a simplification of Noise Contrastive Estimation [Gutmann2012], which was applied to language modeling by \citeauthorMnih2012 (2012). Similar to noise contrastive estimation, negative sampling trains the model to distinguish a target word from negative samples drawn from a ‘noise distribution’ . In this regard, it is similar to MML as defined above, which ranks true sentences above noisy sentences. Negative sampling is defined as follows:

(9)

where is the sigmoid function and is the number of negative samples. The distribution is empirically set to the unigram distribution raised to the power. \citeauthorLevy2014 \citeyearLevy2014 observe that negative sampling does not in fact minimize the negative log-likelihood of the training data as in Equation 7, but rather implicitly factorizes a shifted PMI matrix similar to LSA.

Continuous bag-of-words (CBOW)

While skip-gram predicts each context word separately from the center word, continuous bag-of-words jointly predicts the center word based on all context words. The model receives as input a window of context words and seeks to predict the target word by minimizing the CBOW objective:

(10)
(11)

where is the sum of the word embeddings of the words , i.e. . The CBOW architecture is typically also trained with negative sampling for the same reason as the skip-gram model.

Global vectors (GloVe)

Global vectors [Pennington2014] allows us to learn word representations via matrix factorization. GloVe minimizes the difference between the dot product of the embeddings of a word and its context word and the logarithm of their number of co-occurrences within a certain window size444GloVe favors slightly larger window sizes (up to 10 words to the right and to the left of the target word) compared to SGNS [Levy2015a].:

(12)

where and are the biases corresponding to word and word , captures the number of times word occurs with word , and is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences. If we fix and , then GloVe is equivalent to factorizing a PMI matrix, shifted by [Levy2015a].

4 Cross-Lingual Word Embedding Models: Typology

Recent work on cross-lingual embedding models suggests that the actual choice of bilingual supervision signal—that is, the data a method requires to learn to align a cross-lingual representation space—is more important for the final model performance than the actual underlying architecture [Upadhyay2016, Levy2017]. In other words, large differences between models typically stem from their data requirements, while other fine-grained differences are artifacts of the chosen architecture, hyper-parameters, and additional tricks and fine-tuning employed. This directly mirrors the argument raised by \citeauthorLevy2015a (2015) regarding monolingual embedding models: They observe that the architecture is less important as long as the models are trained under identical conditions on the same type (and amount) of data.

We therefore base our typology on the data requirements of the cross-lingual word embedding methods, as this accounts for much of the variation in performance. In particular, methods differ with regard to the data they employ along the following two dimensions:

  1. Type of alignment: Methods use different types of bilingual supervision signals (at the level of words, sentences, or documents), which introduce stronger or weaker supervision.

  2. Comparability: Methods require either parallel data sources, that is, exact translations in different languages or comparable data that is only similar in some way.

Parallel Comparable
Word Dictionaries Images
Sentence Translations Captions
Document - Wikipedia
Table 2: Nature and alignment level of bilingual data sources required by cross-lingual embedding models.
(a) Word, par.
(b) Word, comp.
(c) Sentence, par.
(d) Sentence, comp.
(e) Doc., comp.
Figure 2: Examples for the nature and type of alignment of data sources. Par.: parallel. Comp.: comparable. Doc.: document. From left to right, word-level parallel alignment in the form of a bilingual lexicon (1(a)), word-level comparable alignment using images obtained with Google search queries (1(b)), sentence-level parallel alignment with translations (1(c)), sentence-level comparable alignment using translations of several image captions (1(d)), and document-level comparable alignment using similar documents (1(e)).

In particular, there are three different types of alignments that are possible, which are required by different methods. We discuss the typical data sources for both parallel and comparable data based on the following alignment signals:

  1. Word alignment: Most approaches use parallel word-aligned data in the form of bilingual or cross-lingual dictionary with pairs of translations between words in different languages [Mikolov2013b, Faruqui2014]. Parallel word alignment can also be obtained by automatically aligning words in a parallel corpus (see below), which can be used to produce bilingual dictionaries. Throughout this survey, we thus do not differentiate between the source of word alignment, whether it comes from type-aligned (dictionaries) or token-aligned data (automatically aligned parallel corpora). Comparable word-aligned data, even though more plentiful, has been leveraged less often and typically involves other modalities such as images [Bergsma2011, Kiela2015b].

  2. Sentence alignment: Sentence alignment requires a parallel corpus, as commonly used in machine translation (MT). Methods typically use the Europarl corpus [Koehn2005], which consists of sentence-aligned text from the proceedings of the European parliament, and is perhaps the most common source of training data for MT models [Hermann2013, Lauly2013]. Other methods use available word-level alignment information [Zou2013, Shi2015]. There has been some work on extracting parallel data from comparable corpora [Munteanu2006], but no-one has so far trained cross-lingual word embeddings on such data. Comparable data with sentence alignment may again leverage another modality, such as captions of the same image or similar images in different languages, which are not translations of each other [Calixto2017, Gella2017].

  3. Document alignment: Parallel document-aligned data requires documents in different languages that are translations of each other. This is rare, as parallel documents typically consist of aligned sentences [Hermann2014]. Comparable document-aligned data is more common and can occur in the form of documents that are topic-aligned (e.g. Wikipedia) or class-aligned (e.g. sentiment analysis and multi-class classification datasets) [Vulic:2013emnlp, Mogadala2016].

We summarize the different types of data required by cross-lingual embedding models along these two dimensions in Table 2 and provide examples for each in Figure 2. Over the course of this survey we will show that models that use a particular type of data are mostly variations of the same or similar architectures. We present our complete typology of cross-lingual embedding models in Table 3, aiming to provide an exhaustive overview by classifying each model (we are aware of) into one of the corresponding model types. We also provide a more detailed overview of the monolingual objectives and regularization terms used by every approach towards the end of this survey in Table 5.

Parallel Comparable
Word —Mapping Mikolov et al. (2013b) Bergsma and Van Durme (2011)
Faruqui and Dyer (2014) Kiela et al. (2015)
Lazaridou et al. (2015) Vulić et al. (2016)
Dinu et al. (2015)
Xing et al. (2015)
Lu et al. (2015)
Vulić and Korhonen (2016)
Ammar et al. (2016b)
Zhang et al. (2016b, 2017ab)
Artexte et al. (2016, 2017, 2018ab)
Smith et al. (2017)
Hauer et al. (2017)
Mrkšić et al. (2017b)
Conneau et al. (2018a)
Joulin et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Ruder et al. (2018)
Glavaš et al. (2019)
Word —Pseudo- bilingual Xiao and Guo (2014) Duong et al. (2015)
Gouws and Søgaard (2015)
Duong et al. (2016)
Adams et al. (2017)
Word —Joint Klementiev et al. (2012)
Kočiský et al. (2014)
Sentence —Matrix factorization Zou et al. (2013)
Shi et al. (2015)
Gardner et al. (2015)
Guo et al. (2015)
Vyas and Carpuat (2016)
Sentence —Compositional Hermann and Blunsom (2013, 2014)
Soyer et al. (2015)
Sentence —Autoencoder Lauly et al. (2013)
Chandar et al. (2014)
Sentence —Skip-gram Gouws et al. (2015)
Luong et al. (2015)
Coulmance et al. (2015)
Pham et al. (2015)
Sentence —Other Levy et al. (2017) Calixto et al. (2017)
Rajendran et al. (2016) Gella et al. (2017)
Document Vulić and Moens (2013a, 2014, 2016)
Søgaard et al. (2015)
Mogadala and Rettinger (2016)
Table 3: Cross-lingual embedding models ordered by data requirements.

5 A Brief History of Cross-Lingual Word Representations

We provide a brief overview of the historical lineage of cross-lingual word embedding models. In brief, while cross-lingual word embeddings is a novel phenomenon, many of the high-level ideas that motivate current research in this area, can be found in work that pre-dates the popular introduction of word embeddings. This includes work on learning cross-lingual word representations from seed lexica, parallel data, or document-aligned data, as well as ideas on learning from limited bilingual supervision.

Language-independent representations have been proposed for decades, many of which rely on abstract linguistic labels instead of lexical features [Aone:1993acl, Schultz:2001sc]. This is also the strategy used in early work on so-called delexicalized cross-lingual and domain transfer [Zeman:2008ijcnlp, Soegaard:11:dp, McDonald:2011emnlp, Cohen:ea:11, Tackstrom:2012naacl, Henderson:2014slt], as well as in work on inducing cross-lingual word clusters [Tackstrom:2012naacl, Faruqui:2013acl], and cross-lingual word embeddings relying on syntactic/POS contexts [Duong:2015conll, Dehouck:2017eacl].555Along the same line, the recent initiative on providing cross-linguistically consistent sets of such labels [<]e.g., Universal Dependencies,>Nivre:2015ud facilitates cross-lingual transfer and offers further support to the induction of word embeddings across languages [Vulic:2017eacl, Vulic:2017conll]. The ability to represent lexical items from two different languages in a shared cross-lingual space supplements seminal work in cross-lingual transfer by providing fine-grained word-level links between languages; see work in cross-lingual dependency parsing [Ammar:2016tacl, Zeman:2017conll] and natural language understanding systems (Mrkšić et al., 2017b).

Similar to our typology of cross-lingual word embedding models outlined in Table 3 based on bilingual data requirements from Table 2, one can also arrange older cross-lingual representation architectures into similar categories. A traditional approach to cross-lingual vector space induction was based on high-dimensional context-counting vectors where each dimension encodes the (weighted) co-occurrences with a specific context word in each of the languages. The vectors are then mapped into a single cross-lingual space using a seed bilingual dictionary containing paired context words from both sides [Rapp:1999acl, Gaussier:2004acl, Laroche:2010coling, Tamura:2012emnlp, inter alia]. This approach is an important predecessor to the cross-lingual word embedding models described in Section 6. Similarly, the bootstrapping technique developed for traditional context-counting approaches [Peirsman:2010naacl, Vulic:2013emnlp] is an important predecessor to recent iterative self-learning techniques used to limit the bilingual dictionary seed supervision needed in mapping-based approaches [Hauer2017, Artetxe:2017acl]. The idea of CCA-based word embedding learning [<]see later in Section 6;>Faruqui2014,Lu:2015naacl was also introduced a decade earlier [Haghighi:2008acl]; their word additionally discussed the idea of combining orthographic subword features with distributional signatures for cross-lingual representation learning: This idea re-entered the literature recently [Heyman:2017eacl], only now with much better performance.

Cross-lingual word embeddings can also be directly linked to the work on word alignment for statistical machine translation [Brown:1993cl, Och:2003cl]. \citeauthorLevy2017 (2017) stress that word translation probabilities extracted from sentence-aligned parallel data by IBM alignment models can also act as the cross-lingual semantic similarity function in lieu of the cosine similarity between word embeddings. Such word translation tables are then used to induce bilingual lexicons. For instance, aligning each word in a given source language sentence with the most similar target language word from the target language sentence is exactly the same greedy decoding algorithm that is implemented in IBM Model 1. Bilingual dictionaries and cross-lingual word clusters derived from word alignment links can be used to boost cross-lingual transfer for applications such as syntactic parsing [Tackstrom:2012naacl, Durrett:2012emnlp], POS tagging [Agic:2015acl], or semantic role labeling [Kozhevnikov:2013acl] by relying on shared lexical information stored in the bilingual lexicon entries. Exactly the same functionality can be achieved by cross-lingual word embeddings. However, cross-lingual word embeddings have another advantage in the era of neural networks: the continuous representations can be plugged into current end-to-end neural architectures directly as sets of lexical features.

A large body of work on multilingual probabilistic topic modeling [Vulic:2015ipm, Boyd:2017book] also extracts shared cross-lingual word spaces, now by means of conditional latent topic probability distributions: two words with similar distributions over the induced latent variables/topics are considered semantically similar. The learning process is again steered by the data requirements. The early days witnessed the use of pseudo-bilingual corpora constructed by merging aligned document pairs, and then applying a monolingual representation model such as LSA [Landauer:1997] or LDA [Blei:2003jmlr] on top of the merged data [Littman:1998, DeSmet:2011pakdd]. This approach is very similar to the pseudo-cross-lingual approaches discussed in Section 6 and Section 8. More recent topic models learn on the basis of parallel word-level information, enforcing word pairs from seed bilingual lexicons (again!) to obtain similar topic distributions [Boyd:2009uai, Zhang:2010acl, Boyd:2010emnlp, Jagarlamudi:2010ecir]. In consequence, this also influences topic distributions of related words not occurring in the dictionary. Another group of models utilizes alignments at the document level [Mimno:2009emnlp, Platt:2010emnlp, Vulic:2011acl, Fukumasu:2012nips, Heyman:2016dami] to induce shared topical spaces. The very same level of supervision (i.e., document alignments) is used by several cross-lingual word embedding models, surveyed in Section 8. Another embedding model based on the document-aligned Wikipedia structure [Sogaard2015] bears resemblance with the cross-lingual Explicit Semantic Analysis model [Gabrilovich:2006aaai, Hassan:2009emnlp, Sorg:2012dke].

All these “historical” architectures measure the strength of cross-lingual word similarities through metrics defined in the cross-lingual space: e.g., Kullback-Leibler or Jensen-Shannon divergence (in a topic space), or vector inner products (in sparse context-counting vector space),and are therefore applicable to NLP tasks that rely cross-lingual similarity scores. The pre-embedding architectures and more recent cross-lingual word embedding methods have been applied to an overlapping set of evaluation tasks and applications, ranging from bilingual lexicon induction to cross-lingual knowledge transfer, including cross-lingual parser transfer [Tackstrom:2012naacl, Ammar:2016tacl], cross-lingual document classification [Gabrilovich:2006aaai, DeSmet:2011pakdd, Klementiev2012, Hermann2014], cross-lingual relation extraction [Faruqui:2015naaclshort], etc. In summary, while sharing the goals and assumptions of older cross-lingual architectures, cross-lingual word embedding methods have capitalized on the recent methodological and algorithmic advances in the field of representation learning, owing their wide use to their simplicity, efficiency and handling of large corpora, as well as their relatively robust performance across domains.

6 Word-Level Alignment Models

In the following, we will now discuss different types of the current generation of cross-lingual embedding models, starting with models based on word-level alignment. Among these, models based on parallel data are more common.

6.1 Word-level Alignment Methods with Parallel Data

We distinguish three methods that use parallel word-aligned data:

  • Mapping-based approaches that first train monolingual word representations independently on large monolingual corpora and then seek to learn a transformation matrix that maps representations in one language to the representations of the other language. They learn this transformation from word alignments or bilingual dictionaries (we do not see a need to distinguish between the two).

  • Pseudo-multi-lingual corpora-based approaches that use monolingual word embedding methods on automatically constructed (or corrupted) corpora that contain words from both the source and the target language.

  • Joint methods that take parallel text as input and minimize the source and target language monolingual losses jointly with the cross-lingual regularization term.

We will show that modulo optimization strategies, these approaches are equivalent. Before discussing the first category of methods, we briefly introduce two concepts that are of relevance in these and the subsequent sections.

Bilingual lexicon induction

Bilingual lexicon induction is the intrinsic task that is most commonly used to evaluate current cross-lingual word embedding models. Briefly, given a list of language word forms , the goal is to determine the most appropriate translation , for each query form . This is commonly accomplished by finding a target language word whose embedding is the nearest neighbour to the source word embedding in the shared semantic space, where similarity is usually computed as the cosine similarity between their embeddings. See Section 10 for more details.

Hubness

Hubness [radovanovic2010hubs] is a phenomenon observed in high-dimensional spaces where some points (known as hubs) are the nearest neighbours of many other points. As translations are assumed to be nearest neighbours in cross-lingual embedding space, hubness has been reported to affect cross-lingual word embedding models.

6.1.1 Mapping-based Approaches

Mapping-based approaches are by far the most prominent category of cross-lingual word embedding models and—due to their conceptual simplicity and ease of use—are currently the most popular. Mapping-based approaches aim to learn a mapping from the monolingual embedding spaces to a joint cross-lingual space. Approaches in this category differ along multiple dimensions:

  1. The mapping method that is used to transform the monolingual embedding spaces into a cross-lingual embedding space.

  2. The seed lexicon that is used to learn the mapping.

  3. The refinement of the learned mapping.

  4. The retrieval of the nearest neighbours.

Mapping Methods

There are four types of mapping methods that have been proposed:

  1. Regression methods map the embeddings of the source language to the target language space by maximizing their similarity.

  2. Orthogonal methods map the embeddings in the source language to maximize their similarity with the target language embeddings, but constrain the transformation to be orthogonal.

  3. Canonical methods map the embeddings of both languages to a new shared space, which maximizes their similarity.

  4. Margin methods map the embeddings of the source language to maximize the margin between correct translations and other candidates.

Figure 3: Similar geometric relations between numbers and animals in English and Spanish [Mikolov2013b]. Words embeddings are projected to two dimensions using PCA and were manually rotated to emphasize similarities.
Regression methods

One of the most influential methods for learning a mapping is the linear transformation method by \citeauthorMikolov2013b \citeyearMikolov2013b. The method is motivated by the observation that words and their translations show similar geometric constellations in monolingual embedding spaces after an appropriate linear transformation is applied, as illustrated in Figure 3. This suggests that it is possible to transform the vector space of a source language to the vector space of the target language by learning a linear projection with a transformation matrix . We use in the following if the direction is unambiguous.

Using the most frequent words from the source language and their translations as seed words, they learn using stochastic gradient descent by minimising the squared Euclidean distance (mean squared error, MSE) between the previously learned monolingual representations of the source seed word that is transformed using and its translation in the bilingual dictionary:

(13)

This can also be written in matrix form as minimizing the squared Frobenius norm of the residual matrix:

(14)

where and are the embedding matrices of the seed words in the source and target language respectively. Analogously, the problem can be seen as finding a least squares solution to a system of linear equations with multiple right-hand sides:

(15)

A common solution to this problem enables calculating analytically as where is the Moore-Penrose pseudoinverse.

In our notation, the MSE mapping approach can be seen as optimizing the following objective:

(16)

First, each of the monolingual losses is optimized independently. Second, the regularization term is optimized while keeping the induced monolingual embeddings fixed. The basic approach of \citeauthorMikolov2013b \citeyearMikolov2013b has later been adopted by many others who for instance incorporated regularization [Dinu2015]. A common preprocessing step that is applied to the monolingual embeddings is to normalize the monolingual embeddings to be unit length. \citeauthorXing2015 (2015) argue that this normalization resolves an inconsistency between the metric used for training (dot product) and the metric used for evaluation (cosine similarity).666For unit vectors, dot product and cosine similarity are equivalent. \citeauthorArtetxe2016 (2016) motivate length normalization to ensure that all training instances contribute equally to the objective.

Orthogonal methods

The most common way in which the basic regression method of the previous section has been improved is to constrain the transformation to be orthogonal, i.e. . The exact solution under this constraint is and can be efficiently computed in linear time with respect to the vocabulary size using SVD where . This constraint is motivated by \citeauthorXing2015 (2015) to preserve length normalization. \citeauthorArtetxe2016 (2016) motivate orthogonality as a means to ensure monolingual invariance. An orthogonality constraint has also been used to regularize the mapping [Zhang2016, Zhang2017] and has been motivated theoretically to be self-consistent [Smith2017].

Canonical methods

Canonical methods map the embeddings in both languages to a shared space using Canonical Correlation Analysis (CCA). \citeauthorHaghighi:2008acl \citeyearHaghighi:2008acl were the first to use this method for learning translation lexicons for the words of different languages. \citeauthorFaruqui2014 (2014) later applied CCA to project words from two languages into a shared embedding space. Whereas linear projection only learns one transformation matrix to project the embedding space of a source language into the space of a target language, CCA learns a transformation matrix for the source and target language and respectively to project them into a new joint space that is different from both the space of and of . We can write the correlation between a projected source language embedding vector and its corresponding projected target language embedding vector as:

(17)

where is the covariance and is the variance. CCA then aims to maximize the correlation (or analogously minimize the negative correlation) between the projected vectors and :

(18)

We can write their objective in our notation as the following:

(19)
\citeauthor

Faruqui2014 propose to use the % projection vectors with the highest correlation. \citeauthorLu:2015naacl (2015) incorporate a non-linearity into the canonical method by training two deep neural networks to maximize the correlation between the projections of both monolingual embedding spaces. \citeauthorAmmar2016 \citeyearAmmar2016 extend the canonical approach to multiple languages.

\citeauthor

Artetxe2016 \citeyearArtetxe2016 show that the canonical method is similar to the orthogonal method with dimension-wise mean centering. \citeauthorArtetxe2018 \citeyearArtetxe2018 show that regression methods, canonical methods, and orthogonal methods can be seen as instances of a framework that includes optional weightening and de-whitening steps, which further demonstrates the similarity of existing approaches.

Margin methods
\citeauthor

Lazaridou2015 (2015) optimize a max-margin based ranking loss instead of MSE to reduce hubness. This max-margin based ranking loss is essentially the same as the MML [Collobert2008] used for learning monolingual embeddings. Instead of assigning higher scores to correct sentence windows, we now try to assign a higher cosine similarity to word pairs that are translations of each other (; first term below) than random word pairs (; second term):

(20)

The choice of the negative examples, which we compare against the translations is crucial. \citeauthorDinu2015 (2015) propose to select negative examples that are more informative by being near the current projected vector but far from the actual translation vector . Unlike random intruders, such intelligently chosen intruders help the model identify training instances where the model considerably fails to approximate the target function. In the formulation adopted in this article, their objective becomes:

(21)

where designates with intruders as negative examples. More recently, \citeauthorJoulin2018a (2018) proposed a margin-based method, which replaces cosine similarity with CSLS, a distance function more suited to bilingual lexicon induction that will be discussed in the retrieval section.

Among the presented mapping approaches, orthogonal methods are the most commonly adopted as the orthogonality constraint improves over the basic regression method.

The Seed Lexicon

The seed lexicon is another core component of any mapping-based approach. In the past, three types of seed lexicons have been used to learn a joint cross-lingual word embedding space:

  1. An off-the-shelf bilingal lexicon.

  2. A weakly supervised bilingual lexicon.

  3. A learned bilingual lexicon.

Off-the-shelf

Most early approaches [Mikolov2013b] employed off-the-shelf or automatically generated bilingual lexicons of frequent words. While \citeauthorMikolov2013b \citeyearMikolov2013b used as much as 5000 pairs, later approaches reduce the number of seed pairs, demonstrating that it is feasible to learn a cross-lingual word embedding space with as little as 25 seed pairs [Artetxe:2017acl].

Weak supervision

Other approaches employ weak supervision to create seed lexicons based on cognates [Smith2017], shared numerals [Artetxe:2017acl], or identically spelled strings [Soegaard2018]. Such weak supervision is easy to obtain and has been shown to produce results that are competitive with off-the-shelf lexicons.

Learned

Recently, approaches have been proposed that learn an initial seed lexicon in a completely unsupervised way. Interestingly, so far, all unsupervised cross-lingual word embedding methods are based on the mapping approach. \citeauthorConneau2018 \citeyearConneau2018 learn an initial mapping in an adversarial way by additionally training a discriminator to differentiate between projected and actual target language embeddings. \citeauthorArtetxe2018b \citeyearArtetxe2018b propose to use an initialisation method based on the heuristic that translations have similar similarity distributions across languages. \citeauthorHoshen2018 (2018) first project vectors of the most frequent words to a lower-dimensional space with PCA. They then aim to find an optimal transformation that minimizes the sum of Euclidean distances by learning and and enforce cyclical consistency constraints that force vectors round-projected to the other language space and back to remain unchanged. \citeauthorAlvarez-Melis2018 (2018) solve an optimal transport in order to learn an alignment between the monolingual word embedding spaces.

The Refinement

Many mapping-based approaches propose to refine the mapping to improve the quality of the initial seed lexicon. \citeauthorVulicKorhonen2016a (2016) propose to learn a first shared bilingual embedding space based on an existing cross-lingual embedding model. They retrieve the translations of frequent source words in this cross-lingual embedding space, which they use as seed words to learn a second mapping. To ensure that the retrieved translations are reliable, they propose a symmetry constraint: Translation pairs are only retained if their projected embeddings are mutual nearest neighbours in the cross-lingual embedding space. This constraint is meant to reduce hubness and has been adopted later in many subsequent methods that rely heavily on refinement [Conneau2018, Artetxe2018b].

Rather than just performing one step of refinement, \citeauthorArtetxe:2017acl (2017) propose a method that iteratively learns a new mapping by using translation pairs from the previous mapping. Training is terminated when the improvement on the average dot product for the induced dictionary falls below a given threshold from one iteration to the next. \citeauthorRuder2018a (2018) solve a sparse linear assignment problem in order to refine the mapping. As discussed in Section 5, the refinement idea is conceptually similar to the work of \citeauthorPeirsman2011 (2010, 2011) and \citeauthorVulic:2013emnlp \citeyearVulic:2013emnlp, with the difference that earlier approaches were developed within the traditional cross-lingual distributional framework (mapping vectors into the count-based space using a seed lexicon). \citeauthorGlavas2019 (2019) propose to learn a matrix and a matrix . They then use the intersection of the translation pairs obtained from both mappings in the subsequent iteration. In practice, one step of refinement is often sufficient as in the second iteration, a large number of noisier word translations are automatically generated [Glavas2019].

While refinement is less crucial when a large seed lexicon is available, approaches that learn a mapping from a small seed lexicon or in a completely unsupervised way rely on refinement [Conneau2018, Artetxe2018b].

The Retrieval

Most existing methods retrieve translations as the nearest neighbours of the source word embeddings in the cross-lingual embedding space based on cosine similarity. \citeauthorDinu2015 (2015) propose to use a globally corrected neighbour retrieval method instead to reduce hubness. \citeauthorSmith2017 (2017) propose a similar solution to the hubness issue: they invert the softmax used for finding the translation of a word at test time and normalize the probability over source words rather than target words. \citeauthorConneau2018 \citeyearConneau2018 propose an alternative similarity measure called cross-domain similarity local scaling (CSLS), which is defined as:

(22)

where is the mean similarity of a target word to its neighbourhood, defined as where is the neighbourhood of the projected source word. Intuitively, CSLS increases the similarity of isolated word vectors and decreases the similarity of hubs. CSLS has been shown to significantly increase the accuracy of bilingual lexicon induction and is nowadays mostly used in lieu of cosine similarity for nearest neighbour retrieval. \citeauthorJoulin2018a (2018) propose to optimize this metric directly when learning the mapping, as noted above. Recently, \citeauthorArtetxe2019 \citeyearArtetxe2019 propose an alternative retrieval method that relies on building a phrase-based MT system from the cross-lingual word embeddings. The MT system is used to generate a synthetic parallel corpus, from which the bilingual lexicon is extracted. The approach has been shown to outperform CSLS retrieval significantly.

Cross-lingual embeddings via retro-fitting

While not strictly a mapping-based approach as it fine-tunes specific monolingual word embeddings, another way to leverage word-level supervision is through the framework of retro-fitting [Faruqui2015]. The main idea behind retro-fitting is to inject knowledge from semantic lexicons into pre-trained word embeddings. Retro-fitting creates a new word embedding matrix whose embeddings are both close to the corresponding learned monolingual word embeddings as well as close to their neighbors in a knowledge graph:

(23)

is the set of edges in the knowledge graph and and control the strength of the contribution of each term.

While the initial retrofitting work focused solely on monolingual word embeddings [Faruqui2015, Wieting:15], Mrkšić et al. (2017b) derive both monolingual and cross-lingual synonymy and antonymy constraints from cross-lingual BabelNet synsets. They then use these constraints to bring the monolingual vector spaces of two different languages together into a shared embedding space. Such retrofitting approaches employ with a careful selection of intruders, similar to the work of \citeauthorLazaridou2015 (2015). While through these external constraints retro-fitting methods can capture relations that are more complex than a linear transformation (as with mapping-based approaches), the original post-processing retrofitting approaches are limited to words that are contained in the semantic lexicons, and do not generalise to words unobserved in the external semantic databases. In other words, the goal of retrofitting methods is to refine vectors of words for which additional high-quality lexical information exists in the external resource, while the methods still back off to distributional vector estimates for all other words.

To remedy the issue with words unobserved in the external resources and learn a global transformation of the entire distributional space in both languages, several methods have been proposed. Post-specialisation approaches first fine-tune vectors of words observed in the external resources, and then aim to learn a global transformation function using the original distributional vectors and their retrofitted counterparts as training pairs. The transformation function can be implemented as a deep feed-forward neural network with non-linear transformations [Vulic:2018post], or it can be enriched by an adversarial component that tries to distinguish between distributional and retrofitted vectors [Ponti:2018emnlp]. While this is a two-step process (1. retrofitting, 2. global transformation learning), an alternative approach proposed by [Vulic2018] learns a global transformation function directly in one step using external lexical knowledge. Furthermore, \citeauthorPandey:ea:17 (2017) explored the orthogonal idea of using cross-lingual word embeddings to transfer the regularization effect of knowledge bases using retrofitting techniques.

6.1.2 Word-level Approaches based on Pseudo-bilingual Corpora

Rather than learning a mapping between the source and the target language, some approaches use the word-level alignment of a seed bilingual dictionary to construct a pseudo-bilingual corpus by randomly replacing words in a source language corpus with their translations. \citeauthorXiao2014 (2014) propose the first such method. Using an initial seed lexicon, they create a joint cross-lingual vocabulary, in which each translation pair occupies the same vector representation. They train this model using MML [Collobert2008] by feeding it context windows of both the source and target language corpus.

Other approaches explicitly create a pseudo-bilingual corpus: \citeauthorGouws2015a (2015) concatenate the source and target language corpus and replace each word that is part of a translation pair with its translation equivalent with a probability of , where is the total number of possible translation equivalents for a word, and train CBOW on this corpus. \citeauthorAmmar2016 \citeyearAmmar2016 extend this approach to multiple languages: Using bilingual dictionaries, they determine clusters of synonymous words in different languages. They then concatenate the monolingual corpora of different languages and replace tokens in the same cluster with the cluster ID. Finally, they train SGNS on the concatenated corpus.

\citeauthor

Duong2016 (2016) propose a similar approach. Instead of randomly replacing every word in the corpus with its translation, they replace each center word with a translation on-the-fly during CBOW training. In addition, they handle polysemy explicitly by proposing an EM-inspired method that chooses as replacement the translation whose representation is most similar to the sum of the source word representation and the sum of the context embeddings as in Equation 11:

(24)

They jointly learn to predict both the words and their appropriate translations using PanLex as the seed bilingual dictionary. PanLex covers around 1,300 language with about 12 million expressions. Consequently, translations are high coverage but often noisy. \citeauthorAdams2017 (2017) use the same approach for pre-training cross-lingual word embeddings for low-resource language modeling.

As we will show shortly, methods based on pseudo-bilingual corpora optimize a similar objective to the mapping-based methods we have previously discussed. In practice, however, pseudo-bilingual methods are more expensive as they require training cross-lingual word embeddings from scratch based on the concatenation of large monolingual corpora. In contrast, mapping-based approaches are much more computationally efficient as they leverage pretrained monolingual word embeddings, while the mapping can be learned very efficiently.

6.1.3 Joint Models

While the previous approaches either optimize a set of monolingual losses and then the cross-lingual regularization term (mapping-based approaches) or optimize a monolingual loss and implicitly—via data manipulation—a cross-lingual regularization term, joint models optimize monolingual and cross-lingual objectives at the same time jointly. In what follows, we discuss a few illustrative example models which sparked this sub-line of research.

Bilingual language model
\citeauthor

Klementiev2012 (2012) cast learning cross-lingual representations as a multi-task learning problem. They jointly optimize a source language and target language model together with a cross-lingual regularization term that encourages words that are often aligned with each other in a parallel corpus to be similar. The monolingual objective is the classic LM objective of minimizing the negative log likelihood of the current word given its previous context words:

(25)

For the cross-lingual regularization term, they first obtain an alignment matrix that indicates how often each source language word was aligned with each target language word from parallel data such as the Europarl corpus [koehn2009statistical]. The cross-lingual regularization term then encourages the representations of source and target language words that are often aligned in to be similar:

(26)

where is the identity matrix and is the Kronecker product, which intuitively “blows up” each element of to the size of . The final regularization term will be the sum of and the analogous term for the other direction (). Note that Equation (24) is a weighted (by word alignment scores) average of inner products, and hence, for unit length normalized embeddings, equivalent to approaches that maximize the sum of the cosine similarities of aligned word pairs. Using to encode interaction is borrowed from linear multi-task learning models [cavallanti2010linear]. There, an interaction matrix encodes the relatedness between tasks. The complete objective is the following:

(27)
Joint learning of word embeddings and word alignments
\citeauthor

Kocisky2014a (2014) simultaneously learn word embeddings and word-level alignments using a distributed version of FastAlign [Dyer2013] together with a language model.777FastAlign is a fast and effective variant of IBM Model 2. Similar to other bilingual approaches, they use the word in the source language sentence of an aligned sentence pair to predict the word in the target language sentence.

They replace the standard multinomial translation probability of FastAlign with an energy function that tries to bring the representation of a target word close to the sum of the context words around the word in the source sentence:

(28)

where and are the representations of source word and target word respectively, is a projection matrix, and and are representation and word biases respectively. The method is trained via Expectation Maximization. Note that this model is conceptually very similar to bilingual models that discard word-level alignment information and learn solely on the basis of sentence-aligned information, which we discuss in Section 7.1.

6.1.4 Sometimes Mapping, Joint and Pseudo-bilingual Approaches are Equivalent

Below we show that while mapping, joint and pseudo-bilingual approaches seem very different, intuitively, they can sometimes be very similar, and in fact, equivalent. We demonstrate this by first defining a pseudo-bilingual approach that is equivalent to an established joint learning technique; and by then showing that same joint learning technique is equivalent to a popular mapping-based approach (for a particular hyper-parameter setting).

We define Constrained Bilingual SGNS. First, recall that in the negative sampling objective of SGNS in Equation 9, the probability of observing a word with a context word with representations and respectively is given as , where denotes the sigmoid function. We now sample a set of negative examples, that is, contexts with which does not occur, as well as actual contexts consisting of pairs, and try to maximize the above for actual contexts and minimize it for negative samples. Second, recall that \citeauthorMikolov2013b \citeyearMikolov2013b obtain cross-lingual embeddings by running SGNS over two monolingual corpora of two different languages at the same time with the constraint that words known to be translation equivalents, according to some dictionary , have the same representation. We will refer to this as Constrained Bilingual SGNS. This is also the approach taken in \citeauthorXiao2014 (2014). is a function from words into their translation equivalents with the representation . With some abuse of notation, we can write the Constrained Bilingual SGNS objective for the source language (idem for the target language):

(29)

In pseudo-bilingual approaches, we instead sample sentences from the corpora in the two languages. When we encounter a word for which we have a translation, that is, we flip a coin and if heads, we replace with a randomly selected member of . In the case, where is bijective as in the work of \citeauthorXiao2014 (2014), it is easy to see that the two approaches are equivalent, in the limit: Sampling mixed -pairs, and will converge to the same representations. We can loosen the requirement that is bijective. To see this, assume, for example, the following word-context pairs: . The vocabulary of our source language is , and the vocabulary of our target language is . Let denote the source language word in the word pair ; etc. To obtain a mixed corpus, such that running SGNS directly on it, will induce the same representations, in the limit, simply enumerate all combinations: . Note that this is exactly the mixed corpus you would obtain in the limit with the approach by \citeauthorGouws2015a (2015). Since this reduction generalizes to all examples where is bijective, this translation provides a constructive demonstration that for any Constrained Bilingual SGNS model, there exists a corpus such that pseudo-bilingual sampling learns the same embeddings as this model. In order to complete the demonstration, we need to establish equivalence in the other direction: Since the mixed corpus constructed using the method in \citeauthorGouws2015a (2015) samples from all replacements licensed by the dictionary, in the limit all words in the dictionary are distributionally similar and will, in the limit, be represented by the same vector representation. This is exactly Constrained Bilingual SGNS. It thus follows that:

Lemma 1.

Pseudo-bilingual sampling is, in the limit, equivalent to Constrained Bilingual SGNS.

While mapping-based and joint approaches seem very different at first sight, they can also be very similar—and, in fact, sometimes equivalent. We give an example of this by demonstrating that two methods in the literature are equivalent under some hyper-parameter settings:

Consider the mapping approach in \citeauthorFaruqui2015 (2015) (retro-fitting) and Constrained Bilingual SGNS [Xiao2014]. Retro-fitting requires two pretrained monolingual embeddings. Let us assume these embeddings were induced using SGNS with a set of hyper-parameters . Retro-fitting minimizes the weighted sum of the Euclidean distances between the seed words and their translation equivalents and their neighbors in the monolingual embeddings, with a parameter that controls the strength of the regularizer. As this parameter goes to infinity, translation equivalents will be forced to have the same representation. As is the case in Constrained Bilingual SGNS, all word pairs in the seed dictionary will be associated with the same vector representation.

Since retro-fitting only affects words in the seed dictionary, the representation of the words not seen in the seed dictionary is determined entirely by the monolingual objectives. Again, this is the same as in Constrained Bilingual SGNS. In other words, if we fix for retro-fitting and Constrained Bilingual SGNS, and set the regularization strength in retro-fitting, retro-fitting is equivalent to Constrained Bilingual SGNS.

Lemma 2.

Retro-fitting of SGNS vector spaces with is equivalent to Constrained Bilingual SGNS.888All other hyper-parameters are shared and equal, including the dimensionality of the vector spaces.

Proof.

We provide a simple bidirectional constructive proof, defining a translation function from each retro-fitting model , with and source and target SGNS embeddings, and an equivalence relation between source and target embeddings , with , to a Constrained Bilingual SGNS model , and back.

Retro-fitting minimizes the weighted sum of the Euclidean distances between the seed words and their translation equivalents and their neighbors in the monolingual embeddings, with a parameter that controls the strength of the regularizer. As this parameter goes to infinity (), translation equivalents will be forced to have the same representation. In both retro-fitting and Constrained Bilingual SGNS, only words in and are directly affected by regularization; the other words only indirectly by being penalized for not being close to distributionally similar words in and .

We therefore define , s.t., iff . Since this function is bijective, provides the backward function from Constrained Bilingual SGNS models to retro-fitting models. This completes the proof that retro-fitting of SGNS vector spaces and Constrained Bilingual SGNS are equivalent when .

6.2 Word-Level Alignment Methods with Comparable Data

All previous methods required word-level parallel data. We categorize methods that employ word-level alignment with comparable data into two types:

  • Language grounding models anchor language in images and use image features to obtain information with regard to the similarity of words in different languages.

  • Comparable feature models that rely on the comparability of some other features. The main feature that has been explored in this context is part-of-speech (POS) tag equivalence.

Grounding language in images

Most methods employing word-aligned comparable data ground words from different languages in image data. The idea in all of these approaches is to use the image space as the shared cross-lingual signals. For instance, bicycles always look like bicycles even if they are referred to as “fiets”, “Fahrrad”, “bicikl”, “bicicletta”, or “velo”. A set of images for each word is typically retrieved using Google Image Search. \citeauthorBergsma2011 (2011) calculate a similarity score for a pair of words based on the visual similarity of their associated image sets. They propose two strategies to calculate the cosine similarity between the color and SIFT features of two image sets: They either take the average of the maximum similarity scores (AvgMax) or the maximum of the maximum similarity scores (MaxMax). \citeauthorKiela2015b (2015) propose to do the same but use CNN-based image features. \citeauthorVulic2016 (2016) in addition propose to combine image and word representations either by interpolating and concatenating them or by interpolating the linguistic and visual similarity scores.

A similar idea of grounding language for learning multimodal multilingual representations can be applied for sensory signals beyond vision, e.g. auditive or olfactory signals [Kiela2015a]. This line of work, however, is currently under-explored. Moreover, it seems that signals from other modalities are typically more useful as an additional source of information to complement the uni-modal signals from text, rather than using other modalities as the single source of information.

POS tag equivalence

Other approaches rely on comparability between certain features of a word, such as its part-of-speech tag. \citeauthorGouws2015a (2015) create a pseudo-cross-lingual corpus by replacing words based on part-of-speech equivalence, that is, words with the same part-of-speech in different languages are replaced with one another. Instead of using the POS tags of the source and target words as a bridge between two languages, we can also use the POS tags of their contexts. This makes strong assumptions about the word orders in the two languages, and their similarity, but \citeauthorDuong:2015conll (2015) use this to obtain cross-lingual word embeddings for several language pairs. They use POS tags as context features and run SGNS on the concatenation of two monolingual corpora. Note that under the (too simplistic) assumptions that all instances of a part-of-speech have the same distribution, and each word belongs to a single part-of-speech class, this approach is equivalent to the pseudo-cross-lingual corpus approach described before.

Summary

Overall, parallel data on the word level is generally preferred over comparable data, as it is relatively easy to obtain for most language pairs and methods relying on parallel data have been shown to outperform methods leveraging comparable data. For methods relying on word-aligned parallel data, even though they optimize similar objectives, mapping-based approaches are the current tool of choice for learning cross-lingual word embeddings due to their conceptual similarity, ease of use, and by virtue of being relatively computationally inexpensive. As monolingual word embeddings have already been learned from large amounts of unlabelled data, the mapping can typically be produced in tens of minutes on a CPU. While unsupervised mapping-based approaches are particularly promising, they still fail for distant language pairs [Soegaard2018] and generally—despite some claims to the contrary—underperform their supervised counterparts [Glavas2019]. At this point, the most robust unsupervised method is the heuristics-based initialisation method by \citeauthorArtetxe2018b \citeyearArtetxe2018b, while the most robust supervised method is the extension of the Procrustes method with mutual nearest neighbours by \citeauthorGlavas2019 (2019). We discuss challenges in Section 12.

7 Sentence-Level Alignment Methods

Thanks to research in MT, large amounts of sentence-aligned parallel data are available for European languages, which has led to much work focusing on learning cross-lingual representations from sentence-aligned parallel data. For low-resource languages or new domains, sentence-aligned parallel data is more expensive to obtain than word-aligned data as it requires fine-grained supervision. Only recently have methods started leveraging sentence-aligned comparable data.

7.1 Sentence-Level Methods with Parallel data

Methods leveraging sentence-aligned data are generally extensions of successful monolingual models. We have detected four main types:

  • Word-alignment based matrix factorization approaches apply matrix factorization techniques to the bilingual setting and typically require additional word alignment information.

  • Compositional sentence models use word representations to construct sentence representations of aligned sentences, which are trained to be close to each other.

  • Bilingual autoencoder models reconstruct source and target sentences using an autoencoder.

  • Bilingual skip-gram models use the skip-gram objective to predict words in both source and target sentences.

Word-alignment based matrix factorization

Several methods directly leverage the information contained in an alignment matrix between source language and target language respectively. is generally automatically derived from sentence-aligned parallel text using an unsupervised word alignment model such as FastAlign [Dyer2013]. captures the number of times the -th word in language was aligned with the -th word in language , with each row normalized to . The intuition is that if a word in the source language is only aligned with one word in the target language, then those words should have the same representation. If the target word is aligned with more than one source word, then its representation should be a combination of the representations of its aligned words. \citeauthorZou2013 (2013) represent the embeddings in the target language as the product of the source embeddings with the corresponding alignment matrix . They then minimize the squared difference between these two terms in both directions:

(30)

Note that can be seen as a variant of , which incorporates soft weights from alignments. In contrast to mapping-based approaches, the alignment matrix, which transforms source to target embeddings, is fixed in this case, while the corresponding source embeddings are learned:

(31)
\citeauthor

Shi2015 (2015) also take into account monolingual data by placing cross-lingual constraints on the monolingual representations and propose two alignment-based cross-lingual regularization objectives. The first one treats the alignment matrix as a cross-lingual co-occurrence matrix and factorizes it using the GloVe objective. The second one is similar to the objective by \citeauthorZou2013 (2013) and minimizes the squared distance of the representations of words in two languages weighted by their alignment probabilities.

\citeauthor

Gardner2015 (2015) extend LSA as translation-invariant LSA to learn cross-lingual word embeddings. They factorize a multilingual co-occurrence matrix with the restriction that it should be invariant to translation, i.e., it should stay the same if multiplied with the respective word or context dictionary.

\citeauthor

Vyas2016 (2016) propose another method based on matrix factorization that enables learning sparse cross-lingual embeddings. As the sparse cross-lingual embeddings are different from the monolingual embeddings , we diverge slightly from our notation and designate them as . They propose two constraints: The first constraint induces monolingual sparse representations from pre-trained monolingual embedding matrices and by factorizing each embedding matrix into two matrices and with an additional constraint on for sparsity:

(32)

To learn bilingual embeddings, they add another constraint based on the alignment matrix that minimizes the reconstruction error between words that were strongly aligned to each other in a parallel corpus:

(33)

The complete optimization then consists of first pre-training monolingual embeddings and with GloVe and in a second step factorizing the monolingual embeddings with the cross-lingual constraint to induce cross-lingual sparse representations and :

(34)
\citeauthor

Guo2015 (2015) similarly create a target language word embedding of a source word by taking the average of the embeddings of its translations weighted by their alignment probability with the source word:

(35)

They propagate alignments to out-of-vocabulary (OOV) words using edit distance as an approximation for morphological similarity and set the target word embedding of an OOV source word as the average of the projected vectors of source words that are similar to it based on the edit distance measure:

(36)

where is the target language word embedding of a source word as created above, , and is set empirically to .

Compositional sentence model
\citeauthor

Hermann2013 (2013) train a model to bring the sentence representations of aligned sentences and in source and target language and respectively close to each other. The representation of sentence in language is the sum of the embeddings of its words:

(37)

They seek to minimize the distance between aligned sentences and :

(38)

They optimize this distance using MML by learning to bring aligned sentences closer together than randomly sampled negative examples:

(39)

where is the number of negative examples. In addition, they use an regularization term for each language so that the final loss they optimize is the following:

(40)

Note that compared to most previous approaches, there is no dedicated monolingual objective and all loss terms are optimized jointly. Note that in this case, the norm is applied to representations , which are computed as the difference of sentence representations.

This regularization term approximates minimizing the mean squared error between the pair-wise interacting source and target words in a way similar to \citeauthorGouws2015 (2015). To see this, note that we minimize the squared error between source and target representations, i.e. —this time only not with regard to word embeddings but with regard to sentence representations. As we saw, these sentence representations are just the sum of their constituent word embeddings. In the limit of infinite data, we therefore implicitly optimize over word representations.

\citeauthor

Hermann2014 (2014) extend this approach to documents, by applying the composition and objective function recursively to compose sentences into documents. They additionally propose a non-linear composition function based on bigram pairs, which outperforms simple addition on large training datasets, but underperforms it on smaller data:

(41)
\citeauthor

Soyer2015 (2015) augment this model with a monolingual objective that operates on the phrase level. The objective uses MML and is based on the assumption that phrases are typically more similar to their sub-phrases than to randomly sampled phrases:

(42)

where is a margin, is a phrase of length sampled from a sentence, is a sub-phrase of of length , and is a phrase sampled from a random sentence. The additional loss terms are meant to reduce the influence of the margin as a hyperparameter and to compensate for the differences in phrase and sub-phrase length.

Bilingual autoencoder

Instead of minimizing the distance between two sentence representations in different languages, \citeauthorLauly2013 (2013) aim to reconstruct the target sentence from the original source sentence. Analogously to \citeauthorHermann2013 (2013), they also encode a sentence as the sum of its word embeddings. They then train an auto-encoder with language-specific encoder and decoder layers and hierarchical softmax to reconstruct from each sentence the sentence itself and its translation. In this case, the encoder parameters are the word embedding matrices and , while the decoder parameters are transformation matrices that project the encoded representation to the output language space. The loss they optimize can be written as follows:

(43)

where denotes the loss for reconstructing from a sentence in language to a sentence in language . Aligned sentences are sampled from parallel text and all losses are optimized jointly.

\citeauthor

Chandar2014 (2014) use a binary BOW instead of the hierarchical softmax. To address the increase in complexity due to the higher dimensionality of the BOW, they propose to merge the bags-of-words in a mini-batch into a single BOW and to perform updates based on this merged bag-of-words. They also add a term to the objective function that encourages correlation between the source and target sentence representations by summing the scalar correlations between all dimensions of the two vectors.

Bilingual skip-gram

Several authors propose extensions of the monolingual skip-gram with negative sampling (SGNS) model to learn cross-lingual embeddings. We show their similarities and differences in Table 4. All of these models jointly optimize monolingual SGNS losses for each language together with one more cross-lingual regularization terms:

(44)

Another commonality is that these models do not require word alignments of aligned sentences. Instead, they make different assumptions about the alignment of the data.

Model Alignment model Monolingual losses Cross-lingual regularizer
BilBOWA [Gouws2015] Uniform
Trans-gram [Coulmance2015] Uniform
BiSkip [Luong2015b] Monotonic
Table 4: A comparison of similarities and differences of the three bilingual skip-gram variants.

Bilingual Bag-of-Words without Word Alignments [<]BilBOWA;>Gouws2015 assumes each word in a source sentence is aligned with every word in the target sentence. If we knew the word alignments, we would try to bring the embeddings of aligned words in source and target sentences close together. Instead, under a uniform alignment model which perfectly matches the intuition behind the simplest (lexical) word alignment IBM Model 1 [Brown:1993cl], we try to bring the average alignment close together. In other words, we use the means of the word embeddings in a sentence as the sentence representations and seek to minimize the distance between aligned sentence representations:

(45)
(46)

Note that this regularization term is very similar to the objective used in the compositional sentence model [Hermann2013, Equations 37 and 38]; the only difference is that we use the mean rather than the sum of word embeddings as sentence representations.

Trans-gram [Coulmance2015] also assumes uniform alignment but uses the SGNS objective as cross-lingual regularization term. Recall that skip-gram with negative sampling seeks to train a model to distinguish context words from negative samples drawn from a noise distribution based on a center word. In the cross-lingual case, we aim to predict words in the aligned target language sentence based on words in the source sentence. Under uniform alignment, we aim to predict all words in the target sentence based on each word in the source sentence:

(47)

where is computed via negative sampling as in Equation 9.

BiSkip [Luong2015b] uses the same cross-lingual regularization terms as Trans-gram, but only aims to predict monotonically aligned target language words: Each source word at position in the source sentence is aligned to the target word at position in the target sentence . In practice, all these bilingual skip-gram models are trained by sampling a pair of aligned sentences from a parallel corpus and minimizing for the source and target language sentence the respective loss terms.

In a similar vein, \citeauthorPham2015 (2015) propose an extension of paragraph vectors [Le2014a] to the multilingual setting by forcing aligned sentences of different languages to share the same vector representation.

Other sentence-level approaches
\citeauthor

Levy2017 (2017) use another sentence-level bilingual signal: IDs of the aligned sentence pairs in a parallel corpus. Their model provides a strong baseline for cross-lingual embeddings that is inspired by the Dice aligner commonly used for producing word alignments for MT. Observing that sentence IDs are already a powerful bilingual signal, they propose to apply SGNS to the word-sentence ID matrix. They show that this method can be seen as a generalization of the Dice Coefficient.

\citeauthor

Rajendran2016 (2016) propose a method that exploits the idea of using pivot languages, also tackled in previous work, e.g., \citeauthorShezaf2010 (2010). The model requires parallel data between each language and a pivot language and is able to learn a shared embedding space for two languages without any direct alignment signals as the alignment is implicitly learned via their alignment with the pivot language. The model optimizes a correlation term with neural network encoders and decoders that is similar to the objective of the CCA-based approaches [Faruqui2014, Lu:2015naacl]. We discuss the importance of pivoting for learning multilingual word embeddings later in Section 9.

In practice, sentence-level supervision is a lot more expensive to obtain than word-level supervision, which is available in the form of bilingual lexicons even for many low-resource languages. For this reason, recent work has largely focused on word-level supervised approaches for learning cross-lingual embeddings. Nevertheless, word-level supervision only enables learning cross-lingual word representations, while for more complex tasks we are often interested in cross-lingual sentence representations.

Recently, fueled by work on pretrained language models [Howard2018, Devlin2018], there have been several extensions of language models to the massively cross-lingual setting, learning cross-lingual representations for many languages at once. \citeauthorArtetxe2018e \citeyearArtetxe2018e train a BiLSTM encoder with a shared vocabulary on parallel data of many languages. \citeauthorLample2019 \citeyearLample2019 extend the bilingual skip-gram approach to the masked language modelling [Devlin2018]: instead of predicting words in the source and target language sentences via skip-gram, they predict randomly masked words in both sentences with a deep language model. Alternatively, their approach can also be trained without parallel data only on concatenated monolingual datasets. Similar to results for word-level alignment-based methods [Soegaard2018], the weak supervision induced by sharing the vocabulary between languages is strong enough as an inductive bias to learn useful cross-lingual representations.

7.2 Sentence Alignment with Comparable Data

Grounding language in images

Similarly to approaches based on word-level aligned comparable data, methods that learn cross-lingual representations using sentence alignment with comparable data do so by associating sentences with images [Kadar:2017arxiv]. The associated image captions/annotations can be direct translations of each other, but are not expected to be in general. The images are then used as pivots to induce a shared multimodal embedding space. These approaches typically use Multi30k [Elliott2016], a multilingual extension of the Flickr30k dataset [Young2014], which contains 30k images and 5 English sentence descriptions and their German translations for each image. \citeauthorCalixto2017 (2017) represent images using features from a pre-trained CNN and model sentences using a GRU. They then use MML to assign a higher score to image-description pairs compared to images with a random description. \citeauthorGella2017 (2017) augment this objective with another MML term that also brings the representations of translated descriptions closer together, thus effectively combining learning signals from both visual and textual modality.

Summary

While recent work has almost exclusively focused on word-level mapping-based approaches, recent language model-based approaches [Artetxe2018e, Lample2019] have started to incorporate parallel resources. Mapping-based approaches that have been shown to rely on the assumption that the embedding spaces in two languages are approximately isomorphic struggle when mapping between a high-resource and a more distant low-resource language [Soegaard2018]. As research increasingly considers this more realistic setting, the additional supervision and context provided by sentence-level alignment may prove to be a valuable resource—and complementary to word-level alignment. Early results in this direction indicate that joint training on parallel corpora yields embeddings that are more isomorphic and less sensitive to hubness than mapping-based approaches [Ormazabal2019]. Consequently, we expect a resurgence of interest in sentence-level alignment methods.

8 Document-Level Alignment Models

Models that require parallel document alignment presuppose that sentence-level parallel alignment is also present. Such models thus reduce to parallel sentence-level alignment methods, which have been discussed in the previous section. Comparable document-level alignment, on the other hand, is more appealing as it is often cheaper to obtain. Existing approaches generally use Wikipedia documents, which they either automatically align, or they employ already theme-aligned Wikipedia documents discussing similar topics.

8.1 Document Alignment with Comparable Data

We divide models using document alignment with comparable data into three types, some of which employ similar general ideas to previously discussed word and sentence-level parallel alignment models:

  • Approaches based on pseudo-bilingual document-aligned corpora automatically construct a pseudo-bilingual corpus containing words from the source and target language by mixing words from document-aligned documents.

  • Concept-based methods leverage information about the distribution of latent topics or concepts in document-aligned data to represent words.

  • Extensions of sentence-aligned models extend methods using sentence-aligned parallel data to also work without parallel data.

Pseudo-bilingual document-aligned corpora

The approach of \citeauthorVulicMoens2016 (2016) is similar to the pseudo-bilingual corpora approaches discussed in Section 6. In contrast to previous methods, they propose a Merge and Shuffle strategy to merge two aligned documents of different languages into a pseudo-bilingual document. This is done by concatenating the documents and then randomly shuffling them by permuting words. The intuition is that as most methods rely on learning word embeddings based on their context, shuffling the documents will lead to robust bilingual contexts for each word. As the shuffling step is completely random, it might lead to sub-optimal configurations.

For this reason, they propose another strategy for merging the two aligned documents, called Length-Ratio Shuffle. It assumes that the structures of the two documents are similar: words are inserted into the pseudo-bilingual document by alternating between the source and the target document relying on the order in which they appear in their monolingual document and based on the monolingual documents’ length ratio.

Concept-based models

Some methods for learning cross-lingual word embeddings leverage the insight that words in different languages are similar if they are used to talk about or evoke the same multilingual concepts or topics. \citeauthorVulic2013a \citeyearVulic2013a base their method on the cognitive theory of semantic word responses. Their method centers on the intuition that words in source and target language are similar if they are likely to generate similar words as their top semantic word responses. They utilise a probabilistic multilingual topic model again trained on aligned Wikipedia documents to learn and quantify semantic word responses. The embedding of source word is the following vector:

(48)

where represents concatenation and is the probability of given under the induced bilingual topic model. The sparse representations may be turned into dense vectors by factorizing the constructed word-response matrix.

\citeauthor

Sogaard2015 (2015) propose an approach that relies on the structure of Wikipedia itself. Their method is based on the intuition that similar words are used to describe the same concepts across different languages. Instead of representing every Wikipedia concept with the terms that are used to describe it, they use an inverted index and represent a word by the concepts it is used to describe. As a post-processing step, dimensionality reduction on the produced word representations in the word-concept matrix is performed. A very similar model by [Vulic:2011acl] uses a bilingual topic model to perform the dimensionality reduction step and learns a shared cross-lingual topical space.

Extensions of sentence-alignment models
\citeauthor

Mogadala2016 (2016) extend the approach of \citeauthorPham2015 (2015) to also work without parallel data and adjust the regularization term based on the nature of the training corpus. Similar to previous work [Hermann2013, Gouws2015], they use the mean of the word embeddings of a sentence as the sentence representation and constrain these to be close together. In addition, they propose to constrain the sentence paragraph vectors and of aligned sentences and to be close to each other. These vectors are learned via paragraph vectors [Le2014a] for each sentence and stored in embedding matrices and . The complete regularizer then uses elastic net regularization to combine both terms: