A Survey of Cross-lingual Word Embedding Models

A Survey of Cross-lingual Word Embedding Models

\nameSebastian Ruder \emailsebastian@ruder.io
\addrInsight Research Centre, National University of Ireland, Galway, Ireland
Aylien Ltd., Dublin, Ireland \AND\nameIvan Vulić \emailiv250@cam.ac.uk
\addrLanguage Technology Lab, University of Cambridge, UK \AND\nameAnders Søgaard \emailsoegaard@di.ku.dk
\addrUniversity of Copenhagen, Copenhagen, Denmark
Abstract

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

A Survey of Cross-lingual Word Embedding Models Sebastian Ruder sebastian@ruder.io
Insight Research Centre, National University of Ireland, Galway, Ireland
Aylien Ltd., Dublin, Ireland
Ivan Vulić iv250@cam.ac.uk
Language Technology Lab, University of Cambridge, UK
Anders Søgaard soegaard@di.ku.dk
University of Copenhagen, Copenhagen, Denmark

1 Introduction

In recent years, monolingual representations of words, so-called word embeddings (?, ?) have captured the imagination of the natural language processing (NLP) community. In addition, reasoning about and applying NLP to multilingual scenarios is receiving increasing interest, fueled by—among other things—the availability of multilingual benchmarks (?, ?, ?). The need to represent meaning and transfer knowledge in cross-lingual applications has given rise to cross-lingual embedding models, which learn cross-lingual representations of words in a joint embedding space, as illustrated in Figure 1.

Cross-lingual word embeddings are appealing due to two main factors: First, they enable cross-lingual and multilingual semantics, that is, reasoning about word meaning in multilingual contexts. An intriguing goal per se, this furthermore enables computation of cross-lingual word similarities, which is relevant for many tasks, and many other applications such as bilingual lexicon induction or cross-lingual information retrieval. Second, cross-lingual word embeddings enable knowledge transfer between languages, most importantly between resource-rich and resource-lean languages. This duality is illustrated by common evaluation tasks for cross-lingual embeddings, which either evaluate their capacity to capture similarity or their ability to serve as features for cross-lingual transfer, as discussed in Section 9.

Figure 1: A shared embedding space between two languages (?)

Many models for learning cross-lingual embeddings have been proposed in recent years. In this survey, we will give a comprehensive overview of existing cross-lingual embedding models. One of the main goals of this survey is to show the similarities and differences between these approaches. To facilitate this, we first introduce a common notation and terminology in Section 2. Over the course of the survey, we then show that existing cross-lingual embedding models can be seen as optimizing very similar objectives, where the only source of variation is due to the data used and the mono-lingual and regularization objectives employed. As many cross-lingual embedding models are inspired by monolingual models, we introduce the most commonly used monolingual embedding models in Section 3. We then motivate and introduce one of the main contributions of this survey, a typology of cross-lingual embedding models in Section 4. The typology is based on the main differentiating aspect of cross-lingual embedding models: the nature of the data they require, in particular the type of alignment and whether data is parallel or comparable. The typology allows us to outline similarities and differences more concisely, but also starkly contrasts focal points of research with fruitful directions that have so far gone mostly unexplored.

Since the idea of cross-lingual representations of words pre-dates word embeddings, we provide a brief history of cross-lingual word representations in Section 5. Subsequent sections are dedicated to each type of alignment. We discuss cross-lingual word embedding algorithms that rely on word-level alignments in Section 6. Such methods can be further divided into mapping-based approaches, approaches based on pseudo-bilingual corpora, and joint methods. We show that these approaches, modulo optimization strategies, are equivalent. We then discuss approaches that rely on sentence-level alignments in Section 7, and models that require document-level alignments in Section 8. We subsequently provide an extensive discussion of the tasks, benchmarks, and challenges of the evaluation of cross-lingual embedding models in Section 9. In Section 10, we furthermore describe how many bilingual approaches that deal with a pair of languages can be extended to the multilingual setting. General challenges in learning cross-lingual word representations are outlined in Section 11. Finally, we present our conclusion in Section 12.

This survey makes the following contributions:

  1. It proposes a general typology that characterizes the differentiating features of cross-lingual word embedding models and provides a compact overview of these models.

  2. It standardizes terminology and notation and shows that many cross-lingual word embedding models can be cast as optimizing nearly the same objective functions.

  3. It provides a proof that connects the three types of word-level alignment models and shows that these models are optimizing roughly the same objective.

  4. It critically examines the standard ways of evaluating cross-lingual embedding models.

  5. It describes multilingual extensions for the most common types of cross-lingual embedding models.

  6. It outlines outstanding challenges for learning cross-lingual word embeddings and provides suggestions for fruitful and unexplored research directions.

Disclaimer

Neural Machine Translation (NMT) is another area that has received increasing interest. NMT approaches implicitly learn a shared cross-lingual embedding space by optimizing for the Machine Translation (MT) objective, whereas we will focus on models that explicitly learn cross-lingual word representations throughout this survey. These methods generally do so at a much lower cost than MT and, in terms of speed and efficiency, can be considered to be to MT what word embedding models (?, ?) are to language modeling.

2 Notation and Terminology

Let be a word embedding matrix that is learned for the -th of languages where is the corresponding vocabulary and is the dimensionality of the word embeddings. We will furthermore refer to , that is, the word embedding of the -th word in language with the shorthand or if the language is unambiguous. We will refer to the word corresponding to the -th word embedding as . For calculating co-occurrences, each word is associated with context words . Some approaches differentiate between the representations of words and words that appear in the context of other words. We will denote the embeddings of context words as . Most approaches only deal with two languages, a source language and a target language .

Some approaches learn a matrix that can be used to transform the word embedding matrix of the source language to that of the target language . We will designate such a matrix by and if the language pairing is unambiguous. These approaches often use source words and their translations as seed words. In addition, we will use as a function that maps from source words to their translation : . Approaches that learn a transformation matrix are usually referred to as offline or mapping methods. As one of the goals of this survey is to standardize nomenclature, we will use the term mapping in the following to designate such approaches.

Some approaches require a monolingual word-word co-occurrence matrix in language . In such a matrix, every row corresponds to a word and every column corresponds to a context word . then captures the number of times word occurs with context word . In a cross-lingual context, we obtain a matrix of alignment counts , where each element captures the number of times the th word in language was aligned with the -th word in language , with each row normalized to .

Finally, as some approaches rely on pairs of aligned sentences, we designate as sentences in source language with representations and analogously refer to their aligned sentences in the target language as with representations . We adopt an analogous notation for representations obtained by approaches based on alignments of documents in and : and .

Different notations make similar approaches appear different. Using the same notation across our survey facilitates recognizing similarities between the various cross-lingual word embedding models. Specifically, we intend to demonstrate that cross-lingual word embedding models are trained by minimizing roughly the same objective function, and that differences in objective are unlikely to explain the observed performance differences (?).

The (class of) objective function(s) minimized by most cross-lingual word embedding methods (if not all), can be formulated as follows:

(1)

where is the monolingual loss of the -th language and is a regularization term. A similar loss was also defined by ? (2016). As recent work (?, ?) shows that many monolingual objectives are in fact equivalent, one of the main contributions of this survey is to condense the difference between approaches into a regularization term and to detail the assumptions that underlie different regularization terms. The monolingual objectives are optimized by training one of several monolingual embedding models on a monolingual corpus. These models are outlined in the next section.

3 Monolingual Embedding Models

A large majority of cross-lingual embedding models take inspiration from and extend monolingual word embedding models to bilingual settings, or explicitly leverage monolingually trained models. As an important preliminary, we thus briefly introduce monolingual embedding models that have been used in the cross-lingual embeddings literature.

Lsa

Latent Semantic Analysis (LSA) (?) has been one of the most widely used methods for learning dense word representations. Given a sparse word-word co-occurrence matrix obtained from a corpus, we replace every entry in with its pointwise mutual information (PMI) (?) score, thus yielding a PMI matrix . We factorize using singular value decomposition (SVD), which decomposes into the product of three matrices:

(2)

where and are in column orthonormal form and is a diagonal matrix of singular values. We subsequently obtain the word embedding matrix by reducing the word representations to dimensionality the following way:

(3)

where is the diagonal matrix containing the top singular values and is obtained by selecting the corresponding columns from .

Max-margin hinge loss (MMHL)

? (2008) learn word embeddings by training a model to output a higher score for a correct word sequence than for an incorrect one. For this purpose, they use a max-margin hinge loss:

(4)

where is the set of sentence windows of size in the training data, is a neural network that outputs a score given an input sentence window , and is a sentence window where the middle word has been replaced by the word .

Sgns

Skip-gram with negative sampling (SGNS) (?) is arguably the most popular method to learn monolingual word embeddings due to its training efficiency and robustness (?). SGNS approximates a language model but focuses on learning efficient word representations rather than accurately modeling word probabilities. It induces representations that are good at predicting surrounding context words given a target word . The objective is shown in Figure 3. To this end, it minimizes the negative log-likelihood of the training data under the following skip-gram objective:

(5)

where is the number of words in the training corpus and is the size of the context window. is computed using the softmax function:

(6)

where and are the word and context word embeddings of word respectively.

Figure 2: The SGNS monolingual embedding model (?)
Figure 3: The CBOW monolingual embedding model (?)

As the partition function in the denominator of the softmax is expensive to compute, SGNS uses Negative Sampling, which approximates the softmax to make it computationally more efficient.

Negative sampling is a simplification of Noise Contrastive Estimation (?), which was applied to language modeling by ? (2012). Similar to noise contrastive estimation, negative sampling trains the model to distinguish a target word from negative samples drawn from a noise distribution . In this regard, it is similar to MMHL as defined above, which ranks true sentences above noisy sentences. Negative sampling is defined as follows:

(7)

where is the sigmoid function and is the number of negative samples. The distribution is empirically set to the unigram distribution raised to the power.

Cbow

Continuous bag-of-words (CBOW) can be seen as the inverse of SGNS: The model receives as input a window of context words and seeks to predict the target word by minimizing the CBOW objective:

(8)
(9)

where is the sum of the word embeddings of the words , i.e. . This is depicted in Figure 3.

GloVe

Global vectors (GloVe) (?) allows us to learn word representations via matrix factorization. GloVe minimizes the difference between the dot product of the embeddings of a word and its context word and the logarithm of their number of co-occurrences within a certain window size111GloVe favors slightly larger window sizes (up to 10 words to the right and to the left of the target word) compared to SGNS (?).:

(10)

where and are the biases corresponding to word and its context word , captures the number of times word occurs with context word , and is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

4 Cross-Lingual Word Embedding Models: Typology

Recent work on cross-lingual embedding models suggests that the actual choice of bilingual signal, that is, the data a model requires is more important for the final model performance than the actual underlying architecture (?). Similar conclusions can be drawn from empirical work in comparing different cross-lingual embedding models (?). In particular, large differences between models stem from their data requirements, while other fine-grained differences are artifacts of the chosen architecture, hyper-parameters, and additional tricks and fine-tuning employed. This directly mirrors the argument raised by ? (2015) regarding monolingual embedding models: they observe that the architecture is less important as long as the models are trained under identical conditions on the same type (and amount) of data.

We therefore base our typology on the data requirements of the cross-lingual embedding models as this accounts for much of the variation in performance. In particular, methods differ with regard to the data they employ along the following two dimensions:

  1. Type of alignment: Models use different types of bilingual signals, which introduce stronger or weaker supervision.

  2. Comparability: Models require either parallel data sources, that is, exact translations in different languages or comparable data that is only similar in some way.

Parallel Comparable
Word Dictionaries Images
Sentence Translations Captions
Document - Wikipedia
Table 1: Nature and alignment level of bilingual data sources required by cross-lingual embedding models.
(a) Word, par.
(b) Word, comp.
(c) Sentence, par.
(d) Sentence, comp.
(e) Doc., comp.
Figure 4: Examples for the nature and type of alignment of data sources. Par.: parallel. Comp.: comparable. Doc.: document. From left to right, word-level parallel alignment in the form of a bilingual lexicon (3(a)), word-level comparable alignment using images obtained with Google search queries (3(b)), sentence-level parallel alignment with translations (3(c)), sentence-level comparable alignment using translations of several image captions (3(d)), and document-level comparable alignment using similar documents (3(e)).

In particular, there are three different types of alignments that are possible, which are required by different methods. We will discuss the typical data sources for both parallel and comparable data based on the following alignment signals:

  1. Word alignment: Most approaches use parallel word-aligned data in the form of bilingual or cross-lingual dictionary with pairs of translations between words in different languages. Comparable word-aligned data, even though more plentiful, has been leveraged less often and involves other modalities such as images.

  2. Sentence alignment: Sentence alignment requires a parallel corpus that is commonly used for MT. Models typically use the Europarl corpus (?) consisting of sentence-aligned text from the proceedings of the European parliament, the standard dataset for training MT models. Some work in addition uses available word-level alignment information.

    There has been some work on extracting parallel data from comparable corpora (?), but no-one has so far trained cross-lingual word embeddings on such data. In addition, it is not clear whether there is clear benefit of using such data, even for truly low-resource settings. Comparable data with sentence alignment may again leverage another modality, such as captions of the same image or similar images in different languages, which are not translations of each other.

  3. Document alignment: Parallel document-aligned data requires documents in different languages that are translations of each other. This is rare, as document-level translation presupposes that sentences are also aligned. Comparable document-aligned data is thus more common and can occur in the form of documents that are topic-aligned (e.g. Wikipedia) or label/class-aligned (e.g. sentiment analysis and multi-class classification datasets).

We summarize the different types of data required by cross-lingual embedding models along these two dimensions in Table 1 and provide examples for each in Figure 4. Over the course of this survey we will show that models that use a particular type of data are mostly variations of the same or similar architectures. We present our complete typology of cross-lingual embedding models in Table 2, aiming to provide an exhaustive overview by classifying each model (we are aware of) into one of the corresponding model types. We also provide a more detailed overview of the monolingual objectives and regularization terms used by every approach towards the end of this survey in Table 4.

Parallel Comparable
Word Mikolov et al. (2013) Bergsma and Van Durme (2011)
Dinu et al. (2015) Vulić et al. (2016)
Lazaridou et al. (2015) Kiela et al. (2015)
Xing et al. (2015) Vulić et al. (2016)
Zhang et al. (2016) Gouws and Søgaard (2015)
Artexte et al. (2016) Duong et al. (2015)
Smith et al. (2016)
Vulić and Korhonen (2016)
Artexte et al. (2017)
Hauer et al. (2017)
Mrkšić et al. (2017)
Faruqui and Dyer (2014)
Lu et al. (2015)
Ammar et al. (2016)
Xiao and Guo (2014)
Gouws and Søgaard (2015)
Duong et al. (2016)
Adams et al. (2017)
Klementiev et al. (2012)
Kočiský et al. (2014)
Sentence Zou et al. (2013) Calixto et al. (2017)
Shi et al. (2015) Gella et al. (2017)
Gardner et al. (2015)
Vyas and Carpuat (2016)
Guo et al. (2015)
Hermann and Blunsom (2013)
Hermann and Blunsom (2014)
Soyer et al. (2015)
Lauly et al. (2013)
Chandar et al. (2014)
Gouws et al. (2015)
Luong et al. (2015)
Coulmance et al. (2015)
Pham et al. (2015)
Levy et al. (2017)
Rajendran et al. (2016)
Document Vulić and Moens (2016)
Vulić and Moens (2013)
Vulić and Moens (2014)
Søgaard et al. (2015)
Mogadala and Rettinger (2016)
Table 2: Cross-lingual embedding models ordered by data requirements.

5 A Brief History of Cross-Lingual Word Representations

Given the consensus supported by empirical outcomes (also discussed in this survey in Section 9) that cross-lingual embedding models are current state-of-the-art representation architectures for enabling and supporting cross-lingual NLP, they naturally lie in the focus of this survey. However, it is essential to deepen our understanding of the most recent methodology by providing a brief overview of and acknowledging its historical lineage. In simple words, whereas the algorithmic design has been substantially revamped with the dawn of cross-lingual embeddings, e.g., by replacing probabilistic graphical models with advanced neural-net inspired architectures and vector space models, plenty of high-level ideas still persist from the “pre-embedding times”. This includes cross-lingual representation learning relying on exactly the same data requirements (e.g., seed lexicons, parallel data, document-aligned data), ideas of language-independent word/feature representations for cross-lingual NLP, learning from limited bilingual supervision, among others.

Language-independent methods have been proposed for decades, many of which rely on abstract linguistic labels instead of lexical features (?, ?). This is also the strategy used in early work on so-called delexicalized cross-lingual and domain transfer (?, ?, ?, ?), as well as in work on inducing cross-lingual word clusters (?, ?), and cross-lingual word embeddings relying on syntactic/POS contexts (?, ?).222Along the same line, the recent initiative on providing cross-linguistically consistent sets of such labels (e.g., Universal Dependencies (?)) facilitates cross-lingual transfer and offers further support to the induction of word embeddings across languages (?, ?). The ability to represent lexical items from two different languages in a shared cross-lingual space leads the NLP research in cross-lingual knowledge transfer one step further: it now provides fine-grained word-level links between languages which are exploited in, e.g., transferred statistical parsing (?, ?) or language understanding systems (?).

Similar to our typology of cross-lingual word embedding models outlined in Table 2 based on bilingual data requirements from Table 1, one can also arrange older cross-lingual representation architectures into similar categories. A traditional approach to cross-lingual vector space induction was based on high-dimensional context-counting vectors where each dimension encodes the (weighted) co-occurrences with a specific context word in each of the languages. The vectors are then mapped into a single cross-lingual space using a seed bilingual dictionary containing paired context words from both sides (?, ?, ?, ?, inter alia). This approach is closely aligned with cross-lingual word embedding models described in Section 6. Even the extensions of the core idea leading to very recent self-learning iterative word embedding models which require inexpensive and small seed bilingual dictionaries (?, ?) have roots in the same principled bootstrapping idea developed for the traditional context-counting approaches (?, ?). The idea of CCA-based word embedding learning (see later in Section 6) (?, ?) was also tackled in prior work (?). The work of Haghighi et al. additionally discusses the idea of combining orthographic subword features with distributional signatures for cross-lingual representation learning: it met its resurgence within the cross-lingual word embedding framework recently (?), only now with much better performance.

A large body of work on multilingual probabilistic topic modeling (?, ?) also extracts shared cross-lingual word spaces, now by means of conditional latent topic probability distributions: two words with similar distributions over the induced latent variables/topics are considered semantically similar. The learning process is again steered by the data requirements. The early days witnessed the use of pseudo-bilingual corpora constructed by merging aligned document pairs, and then applying a monolingual representation model such as LSA (?) or LDA (?) on top of the merged data (?, ?). This approach is fundamentally very similar to pseudo-cross-lingual approaches to word embedding induction discussed later in Section 6 and Section 8. More recent topic models learn on the basis of parallel word-level information, enforcing word pairs from seed bilingual lexicons (again!) to obtain similar topic distributions (?, ?, ?, ?). In consequence, this also influences topic distributions of related words not occurring in the dictionary. Another group of models utilizes alignments at the document level (?, ?, ?, ?, ?) to induce shared topical spaces. The very same level of supervision (i.e., document alignments) is used by several cross-lingual word embedding models, surveyed in Section 8. Another embedding model based on the document-aligned Wikipedia structure (?) bears resemblance with the cross-lingual Explicit Semantic Analysis model (?, ?, ?).

All these “historical” architectures are in fact able to quantify the strength of cross-lingual similarity through various similarity functions exercised on the induced shared cross-lingual space: e.g., Kullback-Leibler or Jensen-Shannon divergence applied on topic distributions, or the inner product on sparse context-counting vectors. Along the same line, ? (2017) stress that word translation probabilities extracted from sentence-aligned parallel data by IBM alignment models can also act as the cross-lingual semantic similarity function in lieu of the cosine similarity between word embeddings. This effectively indicates that cross-lingual architectures are interchangeable within applications that rely solely on such similarity scores. One unmistakable hint linking the previous work on cross-lingual representation learning and modern cross-lingual word embeddings regards exactly their overlapping set of evaluation tasks and applications, ranging from bilingual lexicon induction to cross-lingual knowledge transfer as in cross-lingual parser transfer (?, ?), cross-lingual document classification (?, ?, ?), or relation extraction (?). In summary, while sharing the same set of high-level modeling assumptions and final goals with the older alternative cross-lingual architectures, cross-lingual word embeddings have capitalized on the recent methodological and algorithmic advances in the field of representation learning, owing their wide use to their simplicity, efficiency and handling of large corpora, as well as their solid and robust performance across the board.

6 Word-Level Alignment Models

In the following, we will now discuss different types of the current generation of cross-lingual embedding models, starting with models based on word-level alignment. Among these, models based on parallel data are more common.

6.1 Word-level Alignment Methods with Parallel Data

We distinguish three methods that use parallel word-aligned data:

  • Mapping-based approaches that first train monolingual word representations independently on large monolingual corpora and then seek to learn a transformation matrix that maps representations in one language to the representations of the other language. They learn this transformation from word alignments or bilingual dictionaries (we do not see a need to distinguish between the two).

  • Pseudo-multi-lingual corpora-based approaches that use monolingual word embedding methods on automatically constructed (or corrupted) corpora that contain words from both the source and the target language.

  • Joint methods that take parallel text as input and minimize the source and target language monolingual losses jointly with the cross-lingual regularization term.

We will show that modulo optimization strategies, these approaches are equivalent.

6.1.1 Mapping-based approaches

Minimizing mean squared error

? (2013) notice that the geometric relations that hold between words are similar across languages: for instance, numbers and animals in English show a similar geometric constellation as their Spanish counterparts, as illustrated in Figure 5. This suggests that it is possible to transform the vector space of a source language to the vector space of the target language by learning a linear projection with a transformation matrix . We use in the following if the direction is unambiguous.

Figure 5: Similar geometric relations between numbers and animals in English and Spanish (?)

They use the most frequent words from the source language and their translations as seed words. They then learn using stochastic gradient descent by minimising the squared Euclidean distance (mean squared error, MSE) between the previously learned monolingual representations of the source seed word that is transformed using and its translation in the bilingual dictionary:

(11)

This can also be written in matrix form as minimizing the squared Frobenius norm of the residual matrix:

(12)

where and are the embedding matrices of the seed words in the source and target language respectively. can be computed more efficiently by taking the Moore-Penrose pseudoinverse as (?).

In our notation, the MSE mapping approach can be seen as optimizing the following objective with an initial pre-training phase:

(13)

First, the sum of the monolingual losses + is optimized while ignoring the regularization term. is then optimized in the second step. Several extensions to the basic mapping model as framed by ? (2013) have been proposed. We discuss the developments in the following.

Max-margin with intruders

? (2015) discover that using MSE as the sub-objecive for learning a projection matrix leads to the issue of hubness: some words tend to appear as nearest neighbours of many other words (i.e., they are hubs). They propose a globally corrected neighbour retrieval method to overcome this issue. ? (2015) show that optimizing MMHL (?) instead of MSE reduces hubness, and consequently improves performance. In addition, they propose to select negative examples that are more informative by being near the current projected vector but far from the actual translation vector (“cat”) as depicted in Figure 6. Unlike random intruders, such intelligently chosen intruders help the model identify training instances where the model considerably fails to approximate the target function. In the formulation adopted in this article, their objective becomes:

Figure 6: The intruder “truck” is selected over “dog” as the negative example for “cat” (?).
(14)

? (2017) propose a similar solution to the hubness issue in the framework of mapping-based approaches: they invert the softmax used for finding the translation of a word at test time and normalize the probability over source words rather than target words.

Normalization and orthogonality constraint

? (2015) argue that there is a mismatch between the comparison function used for training word embeddings with SGNS, that is, the dot product and the function used for evaluation, which is cosine similarity. They suggest to normalize word embeddings to be unit length to address this discrepancy. In order to preserve unit length after mapping, they propose, in addition, to constrain to be orthogonal: . The exact solution under this constraint is and can be efficiently computed in linear time with respect to the vocabulary size using SVD where . An orthogonality constraint is also used by ? (2016) for learning cross-lingual embeddings for POS tagging.

? (2016) further demonstrate the similarity between different approaches by showing that the mapping model variant of ? (2015) optimizes the same loss as ? (2013) with an orthogonality constraint and unit vectors. In addition, they empirically show that orthogonality is more important for performance than length normalization. They also propose dimension-wise mean centering in order to capture the intuition that two randomly selected words are generally expected not to be similar and their cosine similarity should thus be zero. ? (2017) also learn a linear transformation with an orthogonality constraint and use identical character strings as their seed lexicon.

Using highly reliable seed lexicon entries

The previous mapping approaches used a bilingual dictionary as an inherent component of their model, but did not pay much attention to the quality of the dictionary entries, using either automatic translations of frequent words or word alignments of all words. ? (2016) emphasize the role of the seed lexicon that is used for learning the projection matrix. They propose a hybrid model that initially learns a first shared bilingual embedding space based on an existing cross-lingual embedding model. They then use this initial vector space to obtain translations for a list of frequent source words by projecting them into the space and using the nearest neighbor in the target language as translation. They then use these translation pairs as seed words for learning a mapping. In addition, they propose a symmetry constraint: it enforces that words are included in the seed lexicon if and only if their projections are nearest neighbors of each other in the first embedding space.

Bootstrapping from few seed entries

Recently, there have been initiatives towards enabling embedding induction using only a small number of seed translation pairs. If effective, such approaches would boost the induction process for truly low-resource language pairs, where only very limited amounts of bilingual data are at hand. The core idea behind these bootstrapping approaches is to start from a few seed words initially, which they then iteratively expand. ? (2017) propose a mapping model that relies only on a small set of shared words (e.g., identically spelled words or only shared numbers) to seed the procedure. The model has multiple bootstrapping rounds where it gradually adds more and more bilingual translation pairs to the original seed lexicon and refines it.

? (2017) and ? (2017) propose a method that creates seed lexicons by identifying cognates in the vocabularies of related languages. In contrast to ? (2013), they learn not only a transformation matrix that transforms embeddings of language to embeddings of language , but also a matrix that transforms embeddings in the opposite direction. Starting from a small set of automatically extracted seed translation pairs, they iteratively expand the size of the lexicon.

As discussed in Section 5, the bootstrapping idea is conceptually similar to the work of ? (2010, 2011) and ? (2013), with the difference that earlier approaches were developed within the traditional cross-lingual distributional framework (mapping vectors into the count-based space using a seed lexicon).

Cross-lingual embeddings via retro-fitting

Learning a mapping between monolingual embedding spaces can also be framed as retro-fitting (?), which is used to inject knowledge from semantic lexicons into pre-trained word embeddings. ? (2017) similarly derive cross-lingual synonymy and antonymy constraints from BabelNet. They then use these constraints to bring the monolingual vector spaces of two different languages together into a shared embedding space. Such retro-fitting approaches employ MMHL with a careful selection of intruders, similar to the work of ? (2015).

CCA-based mapping

? (2014) propose to use another technique to learn a linear mapping between languages. They use canonical correlation analysis (CCA) to project words from two languages into a shared embedding space. Different to linear projection, CCA learns a transformation matrix for every language, as can be seen in Figure 7, where the transformation matrix is used to project word representations from the embedding space to a new space , while transforms words from to . Note that and can be seen as the same shared embedding space.

Figure 7: Cross-lingual projection using CCA (?)

Given two seed vectors and from an embedding matrix of seed word translation pairs in the source language and in the target language respectively and two projection vectors and , we obtain the following projected vectors and :

(15)

Their correlation can be written as:

(16)

where is the covariance of and . CCA then aims to minimize the following:

(17)

We can write their objective in our notation as the following:

(18)

As CCA sorts the correlation vectors in and in descending order, ? find that using the % projection vectors with the highest correlation generally yields the highest performance and that CCA helps to separate synonyms and antonyms in the source language, as can be seen in Figure 8. The figure shows the unprojected antonyms of “beautiful” in two clusters in the top, whereas the CCA-projected vectors of the synonyms and antonyms form two distinct clusters in the bottom.

Figure 8: Monolingual (top) and multi-lingual (bottom; marked with apostrophe) projections of the synonyms and antonyms of “beautiful” (?)

? (2015) extend Bilingual CCA to Deep Bilingual CCA by introducing non-linearity in the mapping process: they train two deep neural networks to maximize the correlation between the projections of both monolingual embedding spaces. Finally, ? (2016) show that their objective, built on top of the original or “standard” Mikolov-style mapping idea, and which uses dimension-wise mean centering is directly related to bilingual CCA. The only fundamental difference is that the CCA-based model does not guarantee monolingual invariance, while this property is enforced in the model of Artetxte et al.

6.1.2 Word-level approaches based on pseudo-bilingual corpora

Rather than learning a mapping between the source and the target language, some approaches use the word-level alignment of a seed bilingual dictionary to construct a pseudo-bilingual corpus by randomly replacing words in a source language corpus with their translations. ? (2014) propose the first such method. They first construct a seed bilingual dictionary by translating all words that appear in the source language corpus into the target language using Wiktionary, filtering polysemous words as well as translations that do not appear in the target language corpus. From this seed dictionary, they create a joint vocabulary, in which each translation pair occupies the same vector representation. They train this model using MMHL (?) by feeding it context windows of both the source and target language corpus.

Other approaches explicitly create a pseudo-bilingual corpus: ? (2015) concatenate the source and target language corpus and replace each word that is part of a translation pair with its translation equivalent with a probability of , where is the total number of possible translation equivalents for a word, and train CBOW on this corpus. ? (2016) extend this approach to multiple languages: Using bilingual dictionaries, they determine clusters of synonymous words in different languages. They then concatenate the monolingual corpora of different languages and replace tokens in the same cluster with the cluster ID. Finally, they train SGNS on the concatenated corpus.

? (2016) propose a similar approach. Instead of randomly replacing every word in the corpus with its translation, they replace each center word with a translation on-the-fly during CBOW training. In addition, they handle polysemy explicitly by proposing an EM-inspired method that chooses as replacement the translation whose representation is most similar to the sum of the source word representation and the sum of the context embeddings as in Equation 9:

(19)

They jointly learn to predict both the words and their appropriate translations using PanLex as the seed bilingual dictionary. PanLex covers around 1,300 language with about 12 million expressions. Consequently, translations are high coverage but often noisy. ? (2017) use the same approach for pre-training cross-lingual word embeddings for low-resource language modeling.

In what follows, we now show that these pseudo-bilingual models are in fact optimizing for the same objective as the mapping models discussed earlier (?).

Proof for the occasional equivalence of mapping and pseudo-bilingual approaches

Recall that in the negative sampling objective of SGNS in Equation 7, the probability of observing a word with a context word with representations and respectively is given as , where denotes the sigmoid function. We now sample a set of negative examples, that is, contexts with which does not occur, as well as actual contexts consisting of pairs, and try to maximize the above for actual contexts and minimize it for negative samples. Recall that ? (2013) obtain cross-lingual embeddings by running SGNS over two monolingual corpora of two different languages at the same time with the constraint that words known to be translation equivalents have the same representation. We will refer to this as Constrained Bilingual SGNS. Further, recall that is a function from words into their translation equivalents (if any are available in the dictionary seed or word alignments) with the representation . With some abuse of notation, this can be written as the following joint objective for the source language (idem for the target language):

(20)

Alternatively, we can sample sentences from the corpora in the two languages. When we encounter a word for which we have a translation, that is, we flip a coin and if heads, we replace with a randomly selected member of .

In the case, where is bijective as in the work of ? (2014), it is easy to see that the two approaches are equivalent. In the limit, sampling mixed -pairs, and will converge to the same representations. If we loosen the requirement that is bijective, establishing equivalence becomes more complicated. It suffices for now to show that, regardless of the nature of , methods based on mixed corpora can be equivalent to methods based on regularization, and can as such also be presented and implemented as joint, regularized optimization problems.

We provide an example and a simple proof sketch here. In Constrained Bilingual SGNS, we can conflate translation equivalents; in fact, it is a common way of implementing this method. So assume the following word-context pairs: . The vocabulary of our source language is , and the vocabulary of our target language is . Let denote the source language word in the word pair ; etc. To obtain a mixed corpus, such that running SGNS directly on it, will induce the same representations, in the limit, simply enumerate all combinations: . Note that this is exactly the mixed corpus you would obtain in the limit with the approach by ? (2014). Since this reduction generalizes to all examples where is bijective, this suffices to show the two approaches are equivalent.

6.1.3 Joint models

The third group of word-level cross-lingual embedding models directly adopts the standard joint formulation of the learning objective provided as:

(21)

where and are monolingual objectives in each languages optimized jointly with the cross-lingual regularization objective. The models in this group differ on the basis of the actual methods and objectives for -s and . In what follows, we discuss a few illustrative example models which sparked this sub-line of research.

Bilingual language model

? (2012) train a neural language model (LM) for the source and target language (i.e. and ) and jointly optimize the monolingual maximum likelihood (MLE) objective of each LM together with a word-alignment based regularization term. The monolingual objective is the classic LM objective of minimizing the negative log likelihood of the current word given its previous context words:

(22)

At the same time, the cross-lingual regularization term encourages the representations of words that are often aligned to each other in a learned alignment matrix to be similar:

(23)

where is the identity matrix and is the Kronecker product. The interaction is borrowed from linear multi-task learning models (?). There, an interaction matrix encodes the relatedness between tasks. At a certain time step, an update is thus not only performed for the current task, but also for related tasks. This effectively enables the interaction between monolingual objectives and the cross-lingual regularization term. In other words, in the bilingual setting, mediates the update of words that were often aligned with the target word. The complete objective is thus the following:

(24)
Joint learning of word embeddings and word alignments

? (2014) simultaneously learn word embeddings and word-level alignments using a distributed version of FastAlign (?) together with a language model. Similar to other bilingual approaches, they use the word in the source language sentence of an aligned sentence pair to predict the word in the target language sentence.

They replace the standard multinomial translation probability of FastAlign with an energy function that tries to bring the representation of a target word close to the sum of the context words around the word in the source sentence:

(25)

where and are the representations of source word and target word respectively, is a projection matrix, and and are representation and word biases respectively. The authors speed up training by using a class factorization strategy similar to the hierarchical softmax and predict frequency-based class representations instead of word representations. For training, they use EM but fix the alignment counts of the E-step learned by FastAlign that was initially trained for 5 epochs. They then optimize the word embeddings in the M-step only. Note that this model is conceptually very similar to bilingual models which discard word-level alignment information and learn solely on the basis of sentence-aligned information which we discuss in Section 7.1.

Proof for the occasional equivalence of mapping and joint approaches

We provide another proof, which now demonstrates that the joint models in fact optimize equivalent objectives as the mapping-based approaches.

First recall the Constrained Bilingual SGNS from above. This model is a simple application of SGNS with the constraint that word pairs that are translation equivalents in our dictionary seed, use the same representation. Now loosen the constraint that translation equivalents must have the same representation, and say instead, that the distance between (the vector representation of) a word  and its translation must be smaller than . This introduces a sphere around the null model. Fitting to the monolingual objectives now becomes a matter of finding the optimum within this bounding sphere.

Intuitively, we can think of mapping approaches as projecting our embeddings back into this bounding sphere, after fitting the monolingual objectives. Note also that this approach introduces an additional inductive bias, from the mapping algorithm itself. While joint approaches are likely to find the optimum within the bounding sphere, it is not clear that there is a projection (within the class of possible projections) from the fit to the monolingual objectives and into the optimum within the bounding sphere. It is not hard to come up with examples where such an inductive bias would hurt performance, but it remains an empirical question whether mapping-based approaches are therefore inferior on average.

In some cases, however, it is possible to show an equivalence between mapping approaches and joint approaches. Consider the mapping approach in ? (2015) (retro-fitting) and Constrained Bilingual SGNS (?).

Retro-fitting requires two monolingual embeddings. Let us assume these embeddings were induced using SGNS with a set of hyper-parameters . Retro-fitting minimizes the weighted sum of the Euclidean distances between the seed words and their translation equivalents and their neighbors in the monolingual embeddings, with a parameter that controls the strength of the regularizer. As this parameter goes to infinity, translation equivalents will be forced to have the same representation. As is the case in Constrained Bilingual SGNS, all word pairs in the seed dictionary will be associated with the same vector representation.

Since retro-fitting only affects words in the seed dictionary, the representation of the words not seen in the seed dictionary, is determined entirely by the monolingual objectives. Again, this is the same as in Constrained Bilingual SGNS. In other words, if we fix for retro-fitting and Constrained Bilingual SGNS, and set the regularization strength in retro-fitting, retro-fitting is equivalent to Constrained Bilingual SGNS.

6.2 Word-Level Alignment Methods with Comparable Data

All previous methods required word-level parallel data. We categorize methods that employ word-level alignment with comparable data into two types:

  • Language grounding models anchor language in images and use image features to obtain information with regard to the similarity of words in different languages.

  • Comparable feature models that rely on the comparability of some other features. The main feature that has been explored in this context is part-of-speech (POS) tag equivalence.

Grounding language in images

Most methods employing word-aligned comparable data ground words from different languages in image data. The idea in all of these approaches is to use the image space as the shared cross-lingual signals. For instance, bicycles always look like bicycles even if they are referred to as “fiets”, “Fahrrad”, “bicikl”, “bicicletta”, or “velo”. A set of images for each word is typically retrieved using Google Image Search. ? (2011) calculate a similarity score for a pair of words based on the visual similarity of their associated image sets. They propose two strategies to calculate the cosine similarity between the color and SIFT features of two image sets: They either take the average of the maximum similarity scores (AvgMax) or the maximum of the maximum similarity scores (MaxMax). ? (2015) propose to do the same but use CNN-based image features. ? (2016) in addition propose to combine image and word representations either by interpolating and concatenating them or by interpolating the linguistic and visual similarity scores.

A similar idea of grounding language for learning multimodal multilingual representations can be applied for sensory signals beyond vision, e.g. auditive or olfactory signals (?). This line of work, however, is currently under-explored. Moreover, it seems that signals from other modalities are typically more useful as an additional source of information to complement the uni-modal signals from text, rather than using other modalities as the single source of information.

POS tag equivalence

Other approaches rely on comparability between certain features of a word, such as its part-of-speech tag. ? (2015) create a pseudo-cross-lingual corpus by replacing words based on part-of-speech equivalence, that is, words with the same part-of-speech in different languages are replaced with one another. Instead of using the POS tags of the source and target words as a bridge between two languages, we can also use the POS tags of their contexts. This makes strong assumptions about the word orders in the two languages, and their similarity, but ? (2015) use this to obtain cross-lingual word embeddings for several language pairs. They use POS tags as context features and run SGNS on the concatenation of two monolingual corpora. Note that under the (too simplistic) assumptions that all instances of a part-of-speech have the same distribution, and each word belongs to a single part-of-speech class, this approach is equivalent to the pseudo-cross-lingual corpus approach described before.

7 Sentence-Level Alignment Methods

Sentence-aligned training data is among the more expensive types of data to obtain, as it requires fine-grained supervision. Due to the availability of large amounts of sentence-aligned parallel data thanks to research in MT, much work has focused on learning cross-lingual representations from parallel data, with only recent work investigating comparable data.

7.1 Sentence-Level Methods with Parallel data

Methods leveraging sentence-aligned data are generally extensions of successful monolingual models. We have detected four main types:

  • Word-alignment based matrix factorization approaches apply matrix factorization techniques to the bilingual setting and typically require additional word alignment information.

  • Compositional sentence models use word representations to construct sentence representations of aligned sentences, which are trained to be close to each other.

  • Bilingual autoencoder models reconstruct source and target sentences using an autoencoder.

  • Bilingual skip-gram models use the skip-gram objective to predict words in both source and target sentences.

Word-alignment based matrix factorization

Several methods directly leverage the information contained in alignment matrices and between source language and target language respectively. If a word in the source language is only aligned with one word in the target language, then those words should have the same representation. If the target word is aligned with more than one source word, then its representation should be a combination of the representations of its aligned words. ? (2013) represent the embeddings in the target language as the product of the source embeddings with the corresponding alignment matrix . They then minimize the squared difference between these two terms in both directions:

(26)
(27)

Note that can be seen as a variant of , which incorporates soft weights from alignments. They optimize the following joint objective during training:

(28)

? (2015) also take into account monolingual data by placing cross-lingual constraints on the monolingual representations as can be seen in Figure 9 and propose two alignment-based cross-lingual regularization objectives. The first one treats the alignment matrix as a cross-lingual co-occurrence matrix and factorizes it using the GloVe objective. The second one is similar to the objective by ? (2013) and minimizes the squared distance of the representations of words in two languages weighted by their alignment probabilities.

Figure 9: Cross-lingual word embeddings via matrix factorisation (?)

? (2015) extend LSA as translation-invariant LSA to learn cross-lingual word embeddings. They factorize a multilingual co-occurrence matrix with the restriction that it should be invariant to translation, i.e., it should stay the same if multiplied with the respective word or context dictionary.

? (2016) propose another method based on matrix factorization that allows learning sparse cross-lingual embeddings. They first learn monolingual sparse representations from monolingual embedding matrices and that were pre-trained using GloVe. To do this, they decompose each matrix into two matrices and such that the reconstruction error is minimized, with an additional constraint on for sparsity:

(29)

To learn bilingual embeddings, they add another constraint based on an automatically learned word alignment matrix that minimizes the reconstruction error between words that were strongly aligned to each other in a parallel corpus:

(30)

With an initial pre-training phase for the GloVe embeddings, the complete objective that they optimize may be formulated as follows:

(31)

? (2015) similarly create target representations of a word by taking the average of the embeddings of its translations weighted by their alignment probability with the source word:

(32)

They propagate alignments to out-of-vocabulary (OOV) words using edit distance as an approximation for morphological similarity and set the projected vector of an OOV source word as the average of the projected vectors of source words that are similar to it based on the edit distance measure:

(33)

where and is set empirically to .

Compositional sentence model

? (2013) train a model to bring the sentence representations of aligned sentences and in source and target language and respectively close to each other, which can be seen in Figure 10. The representation of sentence in language is the sum of the embeddings of its words:

(34)

where is the length of the sentence. They seek to minimize the distance between aligned sentences and :

(35)
Figure 10: A compositional sentence model (?)

They optimize this distance using MMHL by learning to bring aligned sentences closer together than negative examples:

(36)

where is a corpus of aligned sentence pairs and is the number of negative examples. In addition, they use an regularization term for each language so that the final loss they optimize is the following:

(37)

For this case, the norm is applied to representations , which are computed as the difference of sentence representations. In this sense, we are again minimizing the squared error between source and target representations, i.e. – this time only not with regard to word embeddings but with regard to sentence representations. As we saw, these sentence representations are just the sum of their constituent word embeddings. In the limit, we can see this objective also as optimizing over word representations.

? (2014) extend this approach to documents, by applying the composition and objective function recursively to compose sentences into documents. They additionally propose a non-linear composition function based on bigram pairs, which outperforms simple addition on large training datasets, but underperforms it on smaller data:

(38)

? (2015) augment this model with a monolingual objective that operates on the phrase level. The model can be seen in Figure 11. The objective uses MMHL and is based on the assumption that phrases are typically more similar to their sub-phrases than to randomly sampled phrases:

(39)

where is a margin, is a phrase of length sampled from a sentence, is a sub-phrase of of length , and is a phrase sampled from a random sentence. The additional loss terms are meant to reduce the influence of the margin as a hyperparameter and to compensate for the differences in phrase and sub-phrase length.

Figure 11: A compositional model with monolingual inclusion (?)
Bilingual autoencoder

Instead of minimizing the distance between two sentence representations in different languages, ? (2013) aim to reconstruct the target sentence from the original source sentence. Analogously to ? (2013), they also encode a sentence as the sum of its word embeddings. They then train an auto-encoder with language-specific layers encoder and decoder layers and hierarchical softmax to reconstruct from each sentence the sentence itself and its translation, which can be seen in Figure 12. The loss they optimize can be written as follows:

Figure 12: A bilingual autoencoder (?)
(40)

where denotes the loss for reconstructing from a sentence in language to a sentence in language .

? (2014) use a binary BOW instead of the hierarchical softmax as in Figure 13. To address the increase in complexity due to the higher dimensionality of the BOW, they propose to merge the bags-of-words in a mini-batch into a single BOW and to perform updates based on this merged bag-of-words. They also add a term to the objective function that encourages correlation between the source and target sentence representations by summing the scalar correlations between all dimensions of the two vectors.

Figure 13: A bilingual autoencoder with binary reconstruction error (?)
Bilingual skip-gram

? (2015) propose Bilingual Bag-of-Words without Word Alignments (BilBOWA), an extension of SGNS to learning cross-lingual embeddings. Instead of relying on expensive word alignments, they assume that each word in a source sentence is aligned with every word in the target sentence under a uniform alignment model. Thus, instead of minimizing the distance between words that were aligned to each other, they minimize the distance between the means of the word representations in the aligned sentences, which is shown in Figure 14.

Figure 14: Approximating word alignments with uniform alignments (?)

The cross-lingual objective in the BilBOWA model is thus:

(41)

where and are the word embeddings of word and in each sentence and of length and in languages and respectively. Using SGNS as monolingual objective, we obtain the complete loss in Figure 15:

(42)
Figure 15: The BilBOWA model (?)

? (2015) propose BiSkip, a model that uses SGNS to predict both the surrounding words in each language. In contrast to ? (2015), they also use SGNS as the cross-lingual objective. If no alignment is provided, they assume monotonic rather than uniform alignment, where each source word at position is aligned to the target word at position , where and are source and target sentence lengths respectively. Their model optimizes the following loss:

(43)

? (2015) propose Trans-gram, which requires no alignment information and assumes uniform alignment, predicting for each word in a source sentence its context words as well as all words in the aligned target language sentence, and vice versa. Their model can be seen in Figure 16. As the models of ?, ?, and ? are very similar in their design (i.e., they are cross-lingual extensions of the original SGNS model in one way or another), we detail their core differences and their similarity in Table 3.

Figure 16: The Trans-gram model by ? (2015)
Model Alignment model Monolingual losses Cross-lingual regularizer
BilBOWA (?) Uniform
BiSkip (?) Monotonic
Trans-gram (?) Uniform
Table 3: A comparison of similarities and differences of the three bilingual skip-gram variants.

In a similar vein, ? (2015) propose an extension of paragraph vectors (?) to the multilingual setting by forcing aligned sentences of different languages to share the same vector representation.

Other sentence-level approaches

? (2017) use another sentence-level bilingual signal: IDs of the aligned sentence pairs in a parallel corpus. Their model provides a strong baseline for cross-lingual embeddings that is inspired by the Dice aligner commonly used for producing word alignments for MT. Observing that sentence IDs are already a powerful bilingual signal, they propose to apply SGNS to the word-sentence ID matrix. They show that this method can be seen as a generalization of the Dice Coefficient.

? (2016) propose a method that exploits the idea of using pivot languages, also tackled in previous work, e.g., ? (2010). The model requires parallel data between each language and a pivot language and is able to learn a shared embedding space for two languages without any direct alignment signals as the alignment is implicitly learned via their alignment with the pivot language. The model optimizes a correlation term with neural network encoders and decoders that is similar to the objective of the CCA-based approaches (?, ?). We discuss the importance of pivoting for learning multilingual word embeddings later in Section 10.

7.2 Sentence Alignment with Comparable Data

Grounding language in images

Similarly to approaches based on word-level aligned comparable data, methods that learn cross-lingual representations using sentence alignment with comparable data do so by associating sentences with images (?). The images are then used as pivots to induce a shared multimodal embedding space. These approaches typically use Multi30k (?), a multilingual extension of the Flickr30k dataset (?), which contains 30k images and 5 English sentence descriptions and their German translations for each image. ? (2017) represent images using features from a pre-trained CNN and model sentences using a GRU. They then use MMHL to assign a higher score to image-description pairs compared to images with a random description. Their loss is thus the following:

(44)

? (2017) augment this objective with another MMHL term that also brings the representations of translated descriptions closer together, thus effectively combining learning signals from both visual and textual modality:

(45)

8 Document-Level Alignment Models

Models that require parallel document alignment presuppose that sentence-level parallel alignment is also present. Such models thus reduce to parallel sentence-level alignment methods, which have been discussed in the previous section. Comparable document-level alignment, on the other hand, is more appealing as it is often cheaper to obtain. Existing approaches generally use Wikipedia documents, which they either automatically align, or they employ already theme-aligned Wikipedia documents discussing similar topics.

8.1 Document Alignment with Comparable Data

We divide models using document alignment with comparable data into three types, some of which employ similar general ideas to previously discussed word and sentence-level parallel alignment models:

  • Approaches based on pseudo-bilingual document-aligned corpora automatically construct a pseudo-bilingual corpus containing words from the source and target language by mixing words from document-aligned documents.

  • Concept-based methods leverage information about the distribution of latent topics or concepts in document-aligned data to represent words.

  • Extensions of sentence-aligned models extend methods using sentence-aligned parallel data to also work without parallel data.

Pseudo-bilingual document-aligned corpora

The approach of ? (2016) is similar to the pseudo-bilingual corpora approaches discussed in Section 6. In contrast to previous methods, they propose a Merge and Shuffle strategy to merge two aligned documents of different languages into a pseudo-bilingual document. This is done by concatenating the documents and then randomly shuffling them by permuting words. The intuition is that as most methods rely on learning word embeddings based on their context, shuffling the documents will lead to robust bilingual contexts for each word. As the shuffling step is completely random, it might lead to sub-optimal configurations.

For this reason, they propose another strategy for merging the two aligned documents, called Length-Ratio Shuffle. It assumes that the structures of the two documents are similar: words are inserted into the pseudo-bilingual document by alternating between the source and the target document relying on the order in which they appear in their monolingual document and based on the mono-lingual documents’ length ratio. The whole process can be seen in Figure 17.

Figure 17: The Length-Ratio Shuffle strategy (?)
Concept-based models

Some methods for learning cross-lingual word embeddings leverage the insight that words in different languages are similar if they are used to talk about or evoke the same multilingual concepts or topics. ? (2013) base their method on the cognitive theory of semantic word responses. Their method centers on the intuition that words in source and target language are similar if they are likely to generate similar words as their top semantic word responses. They utilise a probabilistic multilingual topic model again trained on aligned Wikipedia documents to learn and quantify semantic word responses. The embedding of source word is the following vector:

(46)

where represents concatenation and is the probability of given under the induced bilingual topic model. The sparse representations may be turned into dense vectors by factorizing the constructed word-response matrix.

? (2015) propose an approach that relies on the structure of Wikipedia itself. Their method is based on the intuition that similar words are used to describe the same concepts across different languages. Instead of representing every Wikipedia concept with the terms that are used to describe it, they invert the index and represent a word by the concepts it is used to describe. As a post-processing step, dimensionality reduction on the produced word representations in the word-concept matrix is performed. A very similar model by (?) uses a bilingual topic model to perform the dimensionality reduction step and learns a shared cross-lingual topical space.

Extensions of sentence-alignment models

? (2016) extend the approach of ? (2015) to also work without parallel data and adjust the regularization term based on the nature of the training corpus. In addition to regularizing the mean of word vectors in a sentence to be close to the mean of word vectors in the aligned sentence similar to ? (2015) (the second term in the below equation), they also regularize the sentence paragraph vectors and of aligned sentences and to be close to each other. The complete regularizer then uses elastic net regularization to combine both terms:

(47)

Their complete objective is thus:

(48)

where is the objective of the paragraph vector extension of SGNS.

Approach Monolingual Regularizer Comment
Klementiev et al. (2012) Joint
Mikolov et al. (2013) Projection-based
Zou et al. (2013) Matrix factorization
Hermann and Blunsom (2013) Sentence-level, joint
Hermann and Blunsom (2014) Sentence-level + bigram composition
Soyer et al. (2015) Phrase-level
Shi et al. (2015) Matrix factorization
Dinu et al. (2015) Better neighbour retrieval
Gouws et al. (2015) Sentence-level
Vyas and Carpuat (2016) Sparse matrix factorization
Hauer et al. (2017) Cognates
Mogadala and Rettinger (2016) Elastic net, Procrustes analysis
Xing et al. (2015) Normalization, orthogonality
Zhang et al. (2016) Orthogonality constraint
Artexte et al. (2016) Normalization, orthogonality,
mean centering
Smith et al. (2017) Orthogonality, inverted softmax
identical character strings
Artexte et al. (2017) Normalization, orthogonality,
mean centering, bootstrapping
Lazaridou et al. (2015) Max-margin with intruders
Mrkšić et al. (2017) Semantic specialization
Calixto et al. (2017) Image-caption pairs
Gella et al. (2017) Image-caption pairs
Faruqui and Dyer (2014) -
Lu et al. (2015) Neural CCA
Rajendran et al. (2016) Pivots
Ammar et al. (2016) Multilingual CCA
Søgaard et al. (2015) - Inverted indexing
Levy et al. (2017)
Levy et al. (2017) - Inverted indexing
Lauly et al. (2013) Autoencoder
Chandar et al. (2014) Autoencoder
Vulić and Moens (2013a) Document-level
Vulić and Moens (2014) Document-level
Xiao and Guo (2014) Translation pairs assigned same embeddings
Gouws and Søgaard (2015) Pseudo-multi-lingual
Luong et al. (2015) Monotonic alignment
Gardner et al. (2015) Matrix factorization
Pham et al. (2015) Paragraph vectors
Guo et al. (2015) Weighted by word alignments
Coulmance et al. (2015) Sentence-level
Ammar et al. (2016) Pseudo-multi-lingual
Vulić and Korhonen (2016) Highly reliable seed entries
Duong et al. (2016) Pseudo-multi-lingual, polysemy
Vulić and Moens (2016) Pseudo-multilingual documents
Adams et al. (2017) Pseudo-multi-lingual, polysemy
Bergsma and Van Durme (2011) - - SIFT image features, similarity
Kiela et al. (2015) - - CNN image features, similarity
Vulić et al. (2016) - - CNN features, similarity, interpolation
Gouws and Søgaard (2015) POS-level Pseudo-multi-lingual
Duong et al. (2015) POS-level Pseudo-multi-lingual
Table 4: Overview of approaches with monolingual objectives and regularization terms.

To leverage data that is not sentence-aligned, but where an alignment is still present on the document level, they propose a two-step approach: They use Procrustes analysis333https://en.wikipedia.org/wiki/Procrustes_analysis, a method for statistical shape analysis, to find the most similar document in language for each document in language . This is done by first learning monolingual representations of the documents in each language using paragraph vectors on each corpus. Subsequently, Procrustes analysis aims to learn a transformation between the two vector spaces by translating, rotating, and scaling the embeddings in the first space until they most closely align to the document representations in the second space. In the second step, they then simply use the previously described method to learn cross-lingual word representations from the alignment documents, this time treating the entire documents as paragraphs.

As a final overview, we list all approaches with their monolingual objectives and regularization terms in Table 4. The table is meant to reveal the high-level objectives and losses each model is optimizing; it thus obscures smaller differences and implementation details, which can be found in the corresponding sections of this survey or by consulting the original papers. We use to represent an infinitely stronger regularizer that enforces equality between representations. Regularizers with a imply that the regularization is achieved in the limit, e.g. in the pseudo-bilingual case, where examples are randomly sampled with some equivalence, we obtain the same representation in the limit, without strictly enforcing it to be the same representation.

As we have demonstrated, most approaches can be seen as optimizing a combination of monolingual losses with a regularization term. As we can see, some approaches do not employ a regularization term; notably, a small number of approaches, i.e., those that ground language in images, do not optimize a loss but rather use pre-trained image features and a set of similarity heuristics to retrieve translations.

9 Evaluation

Having a wide array of cross-lingual models at our disposal is insignificant if we are not able to evaluate them. In the following, we discuss the most common tasks that have been used to test cross-lingual embeddings and outline associated challenges. In addition, we also present resources that can be used for evaluating cross-lingual embeddings and the most important findings of two recent empirical benchmark studies.

9.1 Tasks

The first two widely used tasks are intrinsic evaluation tasks: They evaluate cross-lingual embeddings in a controlled in vitro setting that is geared towards revealing certain characteristics of the representations. The major downside with these tasks is that good performance on them does not generalize necessarily to good performance on downstream tasks (?, ?).

Word similarity

This task evaluates how well the notion of word similarity according to humans is emulated in the vector space. Multi-lingual word similarity datasets are multilingual extensions of datasets that have been used for evaluating English word representations. Many of these originate from psychology research and consist of word pairs – ranging from synonyms (e.g., car - automobile) to unrelated terms (e.g., noon - string) – that have been annotated with a relatedness score by human subjects. The most commonly used ones of these human judgement datasets are: a) the RG dataset (?); b) the MC dataset (?); c) the WordSim-353 dataset (?), a superset of MC; and d) the SimLex-999 dataset (?). Extending them to the multilingual setting then mainly involves translating the word pairs into different languages: WordSim-353 has been translated to Spanish, Romanian, and Arabic (?) and to German, Italian, and Russian (?); RG was translated to German (?), French, (?), Spanish and Farsi (?); and SimLex-999 was translated to German, Italian and Russian (?) and to Hebrew and Croatian (?). Other prominent datasets for word embedding evaluation such as MEN (?), RareWords (?), and SimVerb-3500 (?) have only been used in monolingual contexts.

The SemEval 2017 task on cross-lingual and multilingual word similarity (?) has introduced cross-lingual word similarity datasets between five languages: English, German, Italian, Spanish, and Farsi, yielding 10 new datasets in total. Each cross-lingual dataset is of reasonable size, containing between 888 and 978 word pairs.

Cross-lingual embeddings are evaluated on these datasets by first computing the cosine similarity of the representations of the cross-lingual word pairs. The Spearman’s rank correlation coefficient (?) is then computed between the cosine similarity score and the human judgement scores for every word pair. Cross-lingual word similarity datasets are affected by the same problems as their monolingual variants (?): the annotated notion of word similarity is subjective and is often confused with relatedness; the datasets evaluate semantic rather than task-specific similarity, which is arguably more useful; they do not have standardized splits; they correlate only weakly with performance on downstream tasks; past models do not use statistical significance; and they do not account for polysemy, which is even more important in the cross-lingual setting.

multiQVEC/multiQVEC+

multiQVEC+ is a multilingual extension of QVEC (?), a method that seeks to quantify the linguistic content of word embeddings by maximizing the correlation with a manually-annotated linguistic resource. A semantic linguistic matrix is first constructed from a semantic database. The word embedding matrix is then aligned with the linguistic matrix and the correlation is measured using cumulative dimension-wise correlation. ? (2016) propose QVEC+, which computes correlation using CCA and extend QVEC to the multilingual setting (multiQVEC) by using supersense tag annotations in multiple languages to construct the linguistic matrix. While QVEC has been shown to correlate well with certain semantic downstream tasks, as an intrinsic evaluation task it can only approximately capture the performance as it relates to downstream tasks.

The two following tasks are not intrinsic in the sense that they measure performance on tasks that are potentially of real-world importance. The tasks, however, are constrained to evaluating cross-lingual word embeddings as they rely on nearest neighbor search in the cross-lingual embedding space to identify the most similar target word given a source word.

Word alignment prediction

For word alignment prediction, each word in a given source language sentence is aligned with the most similar target language word from the target language sentence. If a source language word is out of vocabulary, it is not aligned with anything, whereas target language out-of-vocabulary words are given a default minimum similarity score, and never aligned to any candidate source language word in practice (?). The inverse of the alignment error rate (1-AER) (?) is typically used to measure performance, where higher scores mean better alignments. ? (2017) use alignment data from Hansards444https://www.isi.edu/natural-language/download/hansard/ and from four other sources (?, ?, ?, ?).

Bilingual dictionary induction

Bilingual dictionary induction is appealing as an evaluation task, as high-quality, freely available, wide-coverage manually constructed dictionaries are rare, especially for non-European languages. ? (2016) obtain evaluation sets for the task across 26 languages from the Open Multilingual WordNet (?), while ? (2017) obtain bilingual dictionaries from Wiktionary for Arabic, Finnish, Hebrew, Hungarian, and Turkish. Most previous work (?, ?, ?) filters source and target words based on part-of-speech, though this simplifies the task and introduces bias in the evaluation. Each cross-lingual embedding model is then evaluated on its ability to select the closest target language word to a given source language word as the translation of choice and measured based on precision-at-one (P@1). Note that in this setting, 100% is unattainable as many words have multiple translations.

A recent work of ? (2017) follows up on the work of ? (2017) and shows that bilingual dictionary induction can be framed as a classification task. The classification framework allows for the combination of heterogeneous translation evidence (e.g., spatio-temporal information, subword-level similarity, topical similarity) leading to improved results in the task. For instance, Heyman et al. (?) show that the combination of word embeddings with character-level orthographic features strongly boosts the results in expert domains such as biomedicine, but also for general-domain dictionary induction.

Features for cross-lingual transfer

Both word alignment prediction and bilingual dictionary induction rely on (constrained) nearest neighbor search in the cross-lingual word embedding graph based on computed similarity scores. However, cross-lingual word embeddings can also be used directly as features in NLP models. Such models are then defined for several languages, and can be used to facilitate cross-lingual transfer. In other words, the main idea is to train a model on data from one language and then to apply it to another relying on shared cross-lingual features. Extrinsic evaluation on such downstream tasks is often preferred, as it directly allows to evaluate the usefulness of the cross-lingual embedding model for the respective task. We briefly describe the cross-lingual tasks that people have used to evaluate cross-lingual embeddings:

  • Document classification is the task of classifying documents with respect to topic, sentiment, relevance, etc. The task is commonly used following the setup of ? (2012): it uses the RCV2 Reuters multilingual corpus555http://trec.nist.gov/data/reuters/reuters.html. A document classifier is trained to predict topics on the document representations derived from word embeddings in the source language and then tested on the documents of the target language. Such representations typically do not take word order into account, and the standard embedding-based representation is to represent documents by the TF-IDF weighted average over the embeddings of the individual words, with an averaged perceptron model (or some other standard off-the-shelf classification model) acting as the document classifier. The task is therefore by design slightly suboptimal for evaluation of word-level embeddings, since it only evaluates topical associations and only provides a signal for sets of co-occurring words, not for the individual words.

  • Dependency parsing was first evaluated in a cross-lingual setting by ? (2012) who employed cross-lingual similarity measures. Similar to cross-lingual document classification, a dependency parsing model is trained using the embeddings for a source language and is then evaluated on a target language. In the setup of ? (2015), a transition-based dependency parser with a non-linear activation function is trained on Universal Dependencies data (?), with the source-side embeddings as lexical features666https://github.com/jiangfeng1124/acl15-clnndep.

  • POS tagging is usually evaluated using the Universal Dependencies treebanks (?) as these are annotated with the same universal tag set. ? (2016) furthermore map proper nouns to nouns and symbol makers (e.g. “-”, “/”) and interjections to an X tag as it is hard and unnecessary to disambiguate them in a low-resource setting. ? (2017) use data from the CoNLL-X datasets of European languages (?), from CoNLL 2003777http://www.cnts.ua.ac.be/conll2003/ner/ and from ? (2011), the latter of which is also used by ? (2015).

  • Named entity recognition (NER) is the task of tagging entities with their appropriate type in a text. ? (2013) perform NER experiments for English and Chinese on OntoNotes (?), while ? (2016) use English data from CoNLL 2003 (?) and Spanish and Dutch data from CoNLL 2002 (?).

  • Super-sense tagging is a task that involves annotating each significant entity in a text (e.g., nouns, verbs, adjectives and adverbs) within a general semantic taxonomy defined by the WordNet lexicographer classes (called super-senses). The cross-lingual variant of the task is used by ? (2015) for evaluating their embeddings. They use the English data from SemCor888http://web.eecs.umich.edu/˜mihalcea/downloads.html#semcor and publicly available Danish data999https://github.com/coastalcph/noda2015_sst.

  • Semantic parsing is the task of automatically identifying semantically salient targets in the text. Frame-semantic parsing, in particular, disambiguates the targets by assigning a sense (frame) to them, identifies their arguments, and labels the arguments with appropriate roles. ? (2015) create a frame-semantic parsing corpus that covers five topics, two domains (Wikipedia and Twitter), and nine languages and use it to evaluate cross-lingual word embeddings.

  • Discourse parsing is the task of segmenting text into elementary discourse units (mostly clauses), which are then recursively connected via discourse relations to form complex discourse units. The segmentation is usually done according to Rhetorical Structure Theory (RST) (?). ? (2017) and ? (2017) perform experiments using a diverse range of RST discourse treebanks for English, Portuguese, Spanish, German, Dutch, and Basque.

  • Dialog state tracking (DST) is the component in task-oriented dialogue statistical systems that keeps track of the belief state, that is, the system’s internal distribution over the possible states of the dialogue. A recent state-of-the-art DST model of ? (2017) is based exclusively on word embeddings fed into the model as its input. This property of the model enables a straightforward adaptation to cross-lingual settings by simply replacing input monolingual word embeddings with cross-lingual embeddings. Still an under-explored task, we believe that DST serves as a useful proxy task which shows the capability of induced word embeddings to support more complex language understanding tasks. ? (2017) use DST for evaluating cross-lingual embeddings on the Multilingual WOZ 2.0 dataset (?) available in English, German, and Italian. Their results suggest that cross-lingual word embeddings boost the construction of dialog state trackers in German and Italian even without any German and Italian training data, as the model is able to also exploit English training data through the embedding space. Further, a multilingual DST model which uses training data from all three languages combined with a multilingual embedding space improves tracking performance in all three languages.

  • Entity linking or wikification is another task tackled using cross-lingual word embeddings (?). The purpose of the task is to ground mentions written in non-English documents to entries in the English Wikipedia, facilitating the exploration and understanding of foreign texts without full-fledged translation systems (?). Such wikifiers, i.e., entity linkers are a valuable component of several NLP and IR tasks across different domains (?, ?).

  • Sentiment analysis is the task of determining the sentiment polarity (e.g. positive and negative) of a text. ? (2016) evaluate their embeddings on the multilingual Amazon product review dataset of ? (2010).

  • Machine translation is used to translate entire texts in other languages. This is in contrast to bilingual dictionary induction, which focuses on the translation of individual words. ? (2013) used phrase-based machine translation to evaluate their embeddings. Cross-lingual embeddings are incorporated in the phrase-based MT system by adding them as a feature to bilingual phrase-pairs. For each phrase, its word embeddings are averaged to obtain a feature vector.

Information retrieval

Word embeddings in general and cross-lingual word embeddings in specific have naturally found application beyond core NLP applications. They also offer support to Information Retrieval tasks (IR) (?, ?, inter alia) serving as useful features which can link semantics of the query to semantics of the target document collection, even when query terms are not explicitly mentioned in the relevant documents (e.g., the query can talk about cars while a relevant document may contain a near-synonym automobile). A shared cross-lingual embedding space provides means to more general cross-lingual and multilingual IR models without any substantial change in the algorithmic design of the retrieval process (?). Semantic similarity between query and document representations, obtained through the composition process as in the document classification task, is computed in the shared space, irrespective of their actual languages: the similarity score may be used as a measure of document relevance to the information need formulated in the issued query.

Multi-modal and cognitive approaches to evaluation

Evaluation of monolingual word embeddings is a controversial topic. Monolingual word embeddings are useful downstream (?), but in order to argue that one set of embeddings is better than another, we would like a robust evaluation metric. Metrics have been proposed based on co-occurrences (perplexity or word error rate), based on ability to discriminate between contexts (e.g., topic classification), and based on lexical semantics (predicting links in lexical knowledge bases). ? (2016) argues that such metrics are not valid, because co-occurrences, contexts, and lexical knowledge bases are also used to induce word embeddings, and that downstream evaluation is the best way to evaluate word embeddings. The only task-independent evaluation of embeddings that is reasonable, he claims, is to evaluate word embeddings by how well they predict behavioral observations, e.g. gaze or fMRI data.

For cross-lingual word embeddings, it is easier to come up with valid metrics, e.g., P@ in word alignment and bilingual dictionary induction. Note that these metrics only evaluate cross-lingual neighbors, not whether monolingual distances between words reflect synonymy relations. In other words, a random pairing of translation equivalents in vector space would score perfect precision in bilingual dictionary induction tasks. In addition, if we intend to evaluate the ability of cross-lingual word embeddings to allow for generalizations within languages, we inherit the problem of finding valid metrics from monolingual word representation learning.

9.2 Resources and Benchmarks

Resources

In light of the plethora of both intrinsic and extrinsic evaluation tasks and datasets, a rigorous evaluation of cross-lingual embeddings across many benchmark datasets can often be cumbersome and practically infeasible. To the best of our knowledge, there are two resources available, which facilitate comparison of cross-lingual embedding models: ? propose wordvectors.org101010http://wordvectors.org/, a website for evaluating word representations, which allows the upload and evaluation of learned word embeddings. The website, however, focuses mainly on evaluating monolingual word representations and only evaluates them on word similarity datasets.

The second resource is by ? (2016) who make a website111111http://128.2.220.95/multilingual available where monolingual and cross-lingual word representations can be uploaded and automatically evaluated on some of the tasks we discussed. In particular, their evaluation suite includes word similarity, multiQVEC, bilingual dictionary induction, document classification, and dependency parsing. As a good practice in general, we recommend to evaluate cross-lingual word embeddings on an intrinsic task that is cheap to compute and on at least one downstream NLP task besides document classification.

Benchmark studies

To conclude this section, we summarize the findings of two recent benchmark studies of cross-lingual embeddings: ? (2016) evaluate cross-lingual embedding models that require different forms of supervision on various tasks. They find that on word similarity datasets, models with cheaper supervision (sentence-aligned and document-aligned data) are almost as good as models with more expensive supervision in the form of word alignments. For cross-lingual classification and bilingual dictionary induction, more informative supervision is more beneficial: word-alignment and sentence-level models score better. Finally, for dependency parsing, models with word-level alignment are able to capture syntax more accurately and thus perform better overall. The findings by ? strengthen our hypothesis that the choice of the data is more important than the algorithm learning from the same data source.

? (2017) evaluate cross-lingual word embedding models on bilingual dictionary induction and word alignment. Similarly to our hypothesis, they argue that whether or not an algorithm uses a particular feature set is more important than the choice of the algorithm. In their experiments, they achieve the best results using sentence IDs as features to represent words, which outperforms using word-level source and target co-occurrence information. These findings lend further evidence and credibility to our typology that is based on the data requirements of cross-lingual embedding models. Models that learn from word-level and sentence-level information typically outperform other approaches, especially for finer-grained tasks such as bilingual dictionary induction. These studies furthermore raise awareness that we should not only focus on developing better cross-lingual embedding models, but also work on unlocking new data sources and new ways to leverage comparable data, particularly for languages and domains with only limited amounts of parallel training data.

10 From Bilingual to Multilingual Training

So far, for the sake of simplicity and brevity of presentation, we have put focus on models which induce cross-lingual word embeddings in a shared space comprising only two languages. This standard bilingual setup is also in the focus of almost all research in the field of cross-lingual embedding learning. However, notable exceptions such as the recent work of ? (2017) and ? (2017) demonstrate that there are clear benefits to extending the learning process from bilingual to multilingual settings, with improved results reported on standard tasks such as word similarity prediction, bilingual dictionary induction, document classification and dependency parsing.

The usefulness of multilingual training for NLP is already discussed by, e.g., ? (2009) and ? (2010). They corroborate a hypothesis that “variations in ambiguity” may be used as a form of naturally occurring supervision. In simple words, what one language leaves implicit, another defines explicitly and the target language is thus useful for resolving disambiguation in the source language (?). While this hypothesis is already true for bilingual settings, using additional languages introduces additional supervision signals which in turn leads to better word embedding estimates (?).

In most of the literature focused on bilingual settings, English is typically on one side, owing its wide use to the wealth of both monolingual resources available for English as well as bilingual resources, where English is paired with many other languages. However, one would ideally want to also exploit cross-lingual links between other language pairs, reaching beyond English. For instance, typologically/structurally more similar languages such as Finnish and Estonian are excellent candidates for transfer learning. Yet, only few readily available parallel resources exist between Finnish and Estonian that could facilitate a direct induction of a shared bilingual embedding space in these two languages.

A multilingual embedding model which maps Finnish and Estonian to the same embedding space shared with English (i.e., English is used as a resource-rich pivot language) would also enable exploring and utilizing links between Finnish and Estonian lexical items in the space (?). Further, multilingual shared embedding spaces enable multi-source learning and multi-source transfers: this results in a more general model and is less prone to data sparseness (?, ?, ?, ?, ?).

(a) Starting spaces: monolingual
(b) Starting spaces: bilingual
Figure 18: Learning shared multilingual embedding spaces via linear mapping. (a) Starting from monolingual spaces in languages, one linearly maps into one chosen pivot monolingual space (typically English); (b) Starting from bilingual spaces sharing a language (typically English), one learns mappings from all other English subspaces into one chosen pivot English subspace and then applies the mapping to all other subspaces.

The purpose of this section is not to demonstrate the multilingual extension of every single bilingual model discussed in previous sections, as these extensions are not always straightforward and include additional modeling work. However, we will briefly present and discuss several best practices and multilingual embedding models already available in the literature, again following the typology of models established in Table 2.

10.1 Multilingual Word Embeddings from Word-Level Information

Mapping-based approaches

Mapping different monolingual spaces into a single multilingual shared space is achieved by: (1) selecting one space as the pivot space, and then (2) mapping the remaining spaces into the same pivot space. This approach, illustrated by Figure 17(a), requires monolingual spaces and seed translation dictionaries to achieve the mapping. Labeling the pivot language as , we can formulate the induction of a multilingual embedding space as follows:

(49)

This means that through pivoting one is able to induce a shared bilingual space for a language pair without having any directly usable bilingual resources for the pair. Exactly this multilingual mapping procedure (based on minimizing mean squared errors) has been constructed by ? (2017): English is naturally selected as the pivot, and 89 other languages are then mapped into the pivot English space. Seed translation pairs are obtained through Google Translate API by translating the 5,000 most frequent words in each language to English. The recent work of, e.g., ? (2017) holds promise that seed lexicons of similar sizes may also be bootstrapped for resource-lean languages from very small seed lexicons (see again Section 6). ? use original fastText vectors available in 90 languages (?)121212The latest release of fastText vectors contains vectors for 204 languages. The vectors are available here:
https://github.com/facebookresearch/fastText
and effectively construct a multilingual embedding space spanning 90 languages (i.e., 4005 language pairs using 89 seed translation dictionaries!) in their software and experiments.131313https://github.com/Babylonpartners/fastText_multilingual The distances in all monolingual spaces remain preserved by constraining the transformation to be orthogonal.

Figure 19: Illustration of the joint multilingual model of Duong et al. (2017) based on the modified CBOW objective; instead of predicting only the English word given the English context, the model also tries to predict its translations in all the remaining languages (i.e., in languages for which the translations exist in any of the input bilingual lexicons).

Along the same line, ? (2016) introduce a multilingual extension of the CCA-based mapping approach. They perform a multilingual extension of bilingual CCA projection again using the English embedding space as the pivot for multiple English- bilingual CCA projections with the remaining languages.

As demonstrated by Smith et al. (?), the multilingual space now enables reasoning for language pairs not represented in the seed lexicon data. They verify this hypothesis by examining the bilingual lexicon induction task for all language pairs: e.g., BLI scores for Spanish-Catalan without any seed Spanish-Catalan lexicon are 0.82, while the average score for Spanish-English and Catalan-English bilingual spaces is 0.70. Other striking findings include scores for Russian-Ukrainian (0.84 vs. 0.59), Czech-Slovak (0.82 vs. 0.59), Serbian-Croatian (0.78 vs. 0.56), or Danish-Norwegian (0.73 vs. 0.67).

A similar approach to constructing a multilingual embedding space is discussed by Duong et al. (?). However, their mapping approach is tailored for another scenario frequently encountered in practice: one has to align bilingual embedding spaces where English is one of the two languages in each bilingual space. In other words, our starting embedding spaces are now not monolingual as in the previous mapping approach, but bilingual. The overview of the approach is given in Figure 17(b). This approach first selects a pivot bilingual space (e.g., this is the EN-IT space in Figure 17(b)), and then learns a linear mapping/transformation from the English subspace of all other bilingual spaces into the pivot English subspace. The learned linear mapping is then applied to other subspaces (i.e., “foreign” subspaces such as FI, FR, NL, or RU in Figure 17(b)) to transform them into the shared multilingual space.

Figure 20: A multilingual extension of the sentence-level TransGram model of Coulmance et al. (2015). Since the model bypasses the word alignment step in its SGNS-style objective, for each center word (e.g., the EN word cat in this example) the model predicts all words in each sentence (from all other languages) which is aligned to the current sentence (e.g., the EN sentence the cat sat on the mat).
Pseudo-bilingual and joint approaches

The two other sub-groups of word-level models also assume the existence of monolingual data plus multiple bilingual dictionaries covering translations of the same term in multiple languages. The main idea behind pseudo-multilingual approaches is to “corrupt” monolingual data available for each of the languages so that words from all languages are present as context words for every center word in all monolingual corpora. A standard monolingual word embedding model (e.g., CBOW or SGNS) is then used to induce a shared multilingual space. First, for each word in each vocabulary one collects all translations of that word in all other languages. The sets of translations may be incomplete as they are dependent on their presence in input dictionaries. Following that, we use all monolingual corpora in all languages and proceed as the original model of ? (2015): (i) for each word for which there is a set of translations of size , we flip a coin and decide whether to retain the original word or to substitute it with one of its translations; (ii) in case we have to perform a substitution, we choose one translation as a substitute with a probability of . In the limit, this method yields “hybrid” pseudo-multilingual sentences with each word surrounded by words from different languages. Despite its obvious simplicity, the only work that generalizes pseudo-bilingual approaches to the multilingual setting that we are aware of is by ? (2016) who replace all tokens in monolingual corpora with their corresponding translation cluster ID, thereby restricting them to have the same representation. Note again that we do not need lexicons for all language pairs in case one resource-rich language (e.g., English) is selected as the pivot language.

Joint multilingual models rely on exactly the same input data (i.e., monolingual data plus multiple bilingual dictionaries) and the core idea is again to exploit multilingual word contexts. An extension of the joint modeling paradigm to multilingual settings, illustrated in Figure 19, is discussed by Duong et al (?). The core model is an extension of their bilingual model (?) based on the CBOW-style objective: in the multilingual scenario with languages, for each language the training procedure consists of predicting the center word in language given the monolingual context in plus predicting all translations of the center word in all other languages, subject to their presence in the input bilingual dictionaries. Note that effectively this MultiCBOW model may be seen as a combination of multiple monolingual and cross-lingual CBOW-style sub-objectives as follows:

(50)

where the cross-lingual part of the objective again serves as the cross-lingual regularizer . By replacing the CBOW-style objective with the SGNS objective, the model described by Equation (50) effectively gets transformed to the straightforward multilingual extension of the bilingual BiSkip model (?). Exactly this model, called MultiSkip, is described in the work of Ammar et al. (?). Instead of summing contexts around the center word as in CBOW, the model now tries to predict surrounding context words of the center word in its own language, plus its translations and all surrounding context words of its translations in all other languages. Translations are again obtained from input dictionaries or extracted from word alignments as in the original BiSkip and MultiSkip models. The pivot language idea is also applicable with the MultiSkip and MultiCBOW models.

10.2 Multilingual Word Embeddings from Sentence-Level and Document-Level Information

Extending bilingual embedding models which learn on the basis of aligned sentences and documents closely follows the principles already established for word-level models in Section 10.1. For instance, the multilingual extension of the TransGram model from Coulmance et al. (?) may be seen as MultiSkip without word alignment information (see again Table 3). In other words, instead of predicting only words in the neighborhood of the word aligned to the center word, TransGram predicts all words in the sentences aligned to the sentence which contains the current center word (i.e., the model assumes uniform word alignment). This idea is illustrated by Figure 20. English is again used as the pivot language to reduce bilingual data requirements.

The same idea of pivoting, that is, learning multiple bilingual spaces linked through the shared pivot English space is directly applicable to other prominent bilingual word embedding models such as (?), (?), (?), (?), (?).

The document-level model of ? (2016) may be extended to the multilingual setting using the same idea as in previously discussed word-level pseudo-multilingual approaches. ? (2015) and ? (2017) exploit the structure of a multi-comparable Wikipedia dataset and a multi-parallel Bible dataset respectively to directly build sparse cross-lingual representations using the same set of shared indices (i.e., the former uses the indices of aligned Wikipedia articles while the latter relies on the indices of aligned sentences in the multi-parallel corpus). Dense word embeddings are then obtained by factorizing the multilingual word-concept matrix containing all words from all vocabularies.

11 General Challenges and Future Directions

Subword-level information

In morphologically rich languages, words can have complex internal structures, and some word forms can be rare. For such languages, it makes sense to compose representations from representations of lemmas and morphemes. Neural network models increasingly leverage subword-level information (?, ?) and character-based input has been found useful for sharing knowledge in multilingual scenarios (?, ?). Subword-level information has also been used for learning word representations (?, ?) but has so far not been incorporated in learning cross-lingual word representations.

Multi-word expressions

Just like words can be too coarse units for representation learning in morphologically rich languages, words also combine in non-compositional ways to form multi-word expressions such as ad hoc or kick the bucket, the meaning of which cannot be derived from standard representations of their constituents. Dealing with multi-word expressions remains a challenge for monolingual applications and has only received scarce attention in the cross-lingual setting.

Function words

Models for learning cross-linguistic representations share weaknesses with other vector space models of language: While they are very good at modeling the conceptual aspect of meaning evaluated in word similarity tasks, they fail to properly model the functional aspect of meaning, e.g. to distinguish whether one remarks “Give me a pencil" or “Give me that pencil". Modeling the functional aspect of language is of particular importance in scenarios such as dialogue, where the pragmatics of language must be taken into account.

Polysemy

While conflating multiple senses of a word is already problematic for learning mono-lingual word representations, this issue is amplified in a cross-lingual embedding space: If polysemy leads to bad word embeddings in the source language, and bad word embeddings in the target language, we can derive false nearest neighbors from our cross-lingual embeddings. While recent work on learning cross-lingual multi-sense embeddings (?) is extremely interesting, it is still an open question whether modern NLP models can infer from context, what they need in order to resolve lexical ambiguities.

Embeddings for specialized domains

There are many domains, for which cross-lingual applications would be particularly useful, such as bioinformatics or social media. However, parallel data is scarce in many such domains as well as for low-resource languages. Creating robust cross-lingual word representations with as few parallel examples as possible is thus an important research avenue. An important related direction is to leverage comparable corpora, which are often more plentiful and incorporate other signals, such as from multi-modal contexts.

For many domains or tasks, we also might want to have not only word embeddings, but be able to compose those representations into accurate sentence and document representations. Besides existing methods that sum word embeddings, not much work has been doing on learning better higher-level cross-lingual representations.

Feasibility

Learning a general shared vector space for words that reliably captures inter-language and intra-language relations may seem slightly optimistic. Languages are very different, and it is not clear if there is even a definition of words that make words commensurable across languages. Note that while this is related to whether it is possible to translate between the world’s languages in the first place, the possibility of translation (at document level) does not necessarily entail that it is possible to device embeddings such that translation equivalents in two languages end up as nearest neighbors.

There is also the question of what is the computational complexity of finding an embedding that obeys all our inter-lingual and intra-lingual constraints, say, for example, translation equivalents and synonymy. Currently, many approaches to cross-lingual word embeddings, as shown in this survey, minimize a loss that penalizes models for violating such constraints, but there is no guarantee that the final model satisfies all constraints.

Checking whether all such constraints are satisfied in a given model is trivially done in time linear in the number of constraints, but finding out whether such a model exists is much harder. While the problem’s decidability follows from the decidability of two-variable first order logic with equivalence/symmetry closure, determining whether such a graph exists is in fact NP-hard (?).

12 Conclusion

This survey has focused on providing an overview of cross-lingual word embedding models. It has introduced standardized notation and a typology that demonstrated the similarity of many of these models. It provided proofs that connect different word-level embedding models and has described ways to evaluate cross-lingual word embeddings as well as how to extend them to the multilingual setting. It finally outlined challenges and future directions.

References

  • Adams et al. Adams, O., Makarucha, A., Neubig, G., Bird, S., and Cohn, T. (2017). Cross-lingual word embeddings for low-resource language modeling.  In Proceedings of EACL, pp. 937–947.
  • Agić et al. Agić, Z., Johannsen, A., Plank, B., Alonso, H. M., Schluter, N., and Søgaard, A. (2016). Multilingual projection for parsing truly low-resource languages.  Transactions of the ACL, 4, 301–312.
  • Ammar et al. Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C., and Smith, N. A. (2016a). Many languages, one parser.  Transactions of the ACL, 4, 431–444.
  • Ammar et al. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., and Smith, N. A. (2016b). Massively Multilingual Word Embeddings.  CoRR, abs/1602.01925.
  • Aone and McKee Aone, G., and McKee, D. (1993). A language-independent anaphora resolution system for understanding multilingual texts.  In Proceedings of ACL, pp. 156–163.
  • Artetxe et al. Artetxe, M., Labaka, G., and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance.  In Proceedings of EMNLP, pp. 2289–2294.
  • Artetxe et al. Artetxe, M., Labaka, G., and Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data.  In Proceedings of ACL, pp. 451–462.
  • Bergsma and Van Durme Bergsma, S., and Van Durme, B. (2011). Learning bilingual lexicons using the visual similarity of labeled Web images.  In Proceedings of IJCAI, pp. 1764–1769.
  • Bhatia et al. Bhatia, P., Guthrie, R., and Eisenstein, J. (2016). Morphological Priors for Probabilistic Neural Word Embeddings.  EMNLP.
  • Blei et al. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation.  Journal of Machine Learning Research, 3, 993–1022.
  • Bojanowski et al. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information.  Transactions of the ACL, 5, 135–146.
  • Bond and Foster Bond, F., and Foster, R. (2013). Linking and extending an Open Multilingual WordNet.  In Proceedings ACL, pp. 1352–1362.
  • Boyd-Graber et al. Boyd-Graber, J., Hu, Y., and Mimno, D. (2017). Applications of Topic Models, Vol. 11 of Foundations and Trends in Information Retrieval.
  • Boyd-Graber and Resnik Boyd-Graber, J., and Resnik, P. (2010). Holistic sentiment analysis across languages: Multilingual supervised Latent Dirichlet Allocation.  In Proceedings of EMNLP, pp. 45–55.
  • Boyd-Graber and Blei Boyd-Graber, J. L., and Blei, D. M. (2009). Multilingual topic models for unaligned text.  In Proceedings of UAI, pp. 75–82.
  • Braud et al. Braud, C., Coavoux, M., and Søgaard, A. (2017a). Cross-lingual RST discourse parsing.  In Proceedings EACL, pp. 292–304.
  • Braud et al. Braud, C., Lacroix, O., and Søgaard, A. (2017b). Cross-lingual and cross-domain discourse segmentation of entire documents.  In Proceedings of ACL, pp. 237–243.
  • Bruni et al. Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal distributional semantics.  Journal of Artificial Intelligence Research, 49(2014), 1–47.
  • Buchholz and Marsi Buchholz, S., and Marsi, E. (2006). CoNLL-X Shared task on multilingual dependency parsing.  In Proceedings of CoNLL, pp. 149–164.
  • Calixto et al. Calixto, I., Liu, Q., and Campbell, N. (2017). Multilingual Multi-modal Embeddings for Natural Language Processing.  CoRR, abs/1702.01101.
  • Camacho-Collados et al. Camacho-Collados, J., Pilehvar, M. T., Collier, N., and Navigli, R. (2017). SemEval-2017 Task 2: Multilingual and cross-lingual semantic word similarity.  In Proceedings of SEMEVAL, pp. 15–26.
  • Camacho-Collados et al. Camacho-Collados, J., Pilehvar, M. T., and Navigli, R. (2015). A framework for the construction of monolingual and cross-lingual word similarity datasets.  In Proceedings of ACL, pp. 1–7.
  • Cavallanti et al. Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Linear algorithms for online multitask classification.  Journal of Machine Learning Research, 11, 2901–2934.
  • Chandar et al. Chandar, S., Lauly, S., Larochelle, H., Khapra, M. M., Ravindran, B., Raykar, V., and Saha, A. (2014). An autoencoder approach to learning bilingual word representations.  In Proceedings of NIPS, pp. 1853–1861.
  • Cheng and Roth Cheng, X., and Roth, D. (2013). Relational inference for wikification.  In Proceedings of EMNLP, pp. 1787–1796.
  • Church and Hanks Church, K. W., and Hanks, P. (1990). Word association norms, mutual information, and lexicography.  Computational Linguistics, 16(1), 22–29.
  • Collobert and Weston Collobert, R., and Weston, J. (2008). A unified architecture for Natural Language Processing.  In Proceedings of ICML, pp. 160–167.
  • Coulmance et al. Coulmance, J., Marty, J.-M., Wenzek, G., and Benhalloum, A. (2015). Trans-gram, fast cross-lingual word-embeddings.  In Proceedings of EMNLP, pp. 1109–1113.
  • Das and Petrov Das, D., and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections.  In Proceedings of ACL, pp. 600–609.
  • De Smet et al. De Smet, W., Tang, J., and Moens, M. (2011). Knowledge transfer across multilingual corpora via latent topics.  In Proceedings of PAKDD, pp. 549–560.
  • Deerwester et al. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis.  Journal of the American Society for Information Science, 41(6), 391–407.
  • Dehouck and Denis Dehouck, M., and Denis, P. (2017). Delexicalized word embeddings for cross-lingual dependency parsing.  In Proceedings of EACL, pp. 241–250.
  • Dinu et al. Dinu, G., Lazaridou, A., and Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem.  In Proceedings of ICLR (Workshop Track).
  • Duong et al. Duong, L., Cohn, T., Bird, S., and Cook, P. (2015). Cross-lingual transfer for unsupervised dependency parsing without parallel data.  In Proceedings of CoNLL, pp. 113–122.
  • Duong et al. Duong, L., Kanayama, H., Ma, T., Bird, S., and Cohn, T. (2016). Learning crosslingual word embeddings without bilingual corpora.  In Proceedings of EMNLP, pp. 1285–1295.
  • Duong et al. Duong, L., Kanayama, H., Ma, T., Bird, S., and Cohn, T. (2017). Multilingual training of crosslingual word embeddings.  In Proceedings of EACL, pp. 894–904.
  • Dyer et al. Dyer, C., Chahuneau, V., and Smith, N. (2013). A simple, fast, and effective parameterization of IBM Model 2.  In Proceedings of NAACL-HLT, pp. 644–649.
  • Eades and Whitesides Eades, P., and Whitesides, S. (1995). Nearest neighbour graph realizability is np-hard.  In Latin American Symposium on Theoretical Informatics, pp. 245–256. Springer.
  • Elliott et al. Elliott, D., Frank, S., Sima’an, K., and Specia, L. (2016). Multi30k: Multilingual English-German image descriptions.  In Proceedings of the 5th Workshop on Vision and Language, pp. 70–74.
  • Elliott and Kádár Elliott, D., and Kádár, Á. (2017). Imagination improves multimodal translation.  CoRR, abs/1705.04350.
  • Fang and Cohn Fang, M., and Cohn, T. (2017). Model transfer for tagging low-resource languages using a bilingual dictionary.  In Proceedings of ACL, pp. 587–593.
  • Faruqui et al. Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., and Smith, N. A. (2015). Retrofitting word vectors to semantic lexicons.  In Proceedings of NAACL-HLT, pp. 1606–1615.
  • Faruqui and Dyer Faruqui, M., and Dyer, C. (2013). An information theoretic approach to bilingual word clustering.  In Proceedings of ACL, pp. 777–783.
  • Faruqui and Dyer Faruqui, M., and Dyer, C. (2014a). Community evaluation and exchange of word vectors at wordvectors.org.  In Proceedings of ACL: System Demonstrations, pp. 19–24.
  • Faruqui and Dyer Faruqui, M., and Dyer, C. (2014b). Improving vector space word representations using multilingual correlation.  In Proceedings of EACL, pp. 462–471.
  • Faruqui and Kumar Faruqui, M., and Kumar, S. (2015). Multilingual open relation extraction using cross-lingual projection.  In Proceedings of NAACL-HLT, pp. 1351–1356.
  • Faruqui et al. Faruqui, M., Tsvetkov, Y., Rastogi, P., and Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks.  In Proceedings of REPEVAL, pp. 30–35.
  • Finkelstein et al. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2002). Placing search in context: The concept revisited.  ACM Transactions on Information Systems, 20(1), 116–131.
  • Firat et al. Firat, O., Cho, K., and Bengio, Y. (2016). Multi-way, multilingual neural machine translation with a shared attention mechanism.  In Proceedings of NAACL-HLT, pp. 866–875.
  • Fukumasu et al. Fukumasu, K., Eguchi, K., and Xing, E. P. (2012). Symmetric correspondence topic models for multilingual text analysis.  In Proceedings of NIPS, pp. 1295–1303.
  • Gabrilovich and Markovitch Gabrilovich, E., and Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge.  In Proceedings of AAAI, pp. 1301–1306.
  • Gardner et al. Gardner, M., Huang, K., Paplexakis, E., Fu, X., Talukdar, P., Faloutsos, C., Sidiropoulos, N., Mitchell, T., and Sidiropoulos, N. (2015). Translation invariant word embeddings.  In Proceedings of EMNLP, pp. 1084–1088.
  • Gaussier et al. Gaussier, É., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora.  In Proceedings of ACL, pp. 526–533.
  • Gella et al. Gella, S., Sennrich, R., Keller, F., and Lapata, M. (2017). Image pivoting for learning multilingual multimodal representations.  In Proceedings of EMNLP, pp. 2829–2835.
  • Gerz et al. Gerz, D., Vulić, I., Hill, F., Reichart, R., and Korhonen, A. (2016). SimVerb-3500: A large-scale evaluation set of verb similarity.  In Proceedings of EMNLP, pp. 2173–2182.
  • Gillick et al. Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. (2016). Multilingual Language Processing From Bytes.  NAACL, 1296–1306.
  • Gouws et al. Gouws, S., Bengio, Y., and Corrado, G. (2015). BilBOWA: Fast bilingual distributed representations without word alignments.  In Proceedings of ICML, pp. 748–756.
  • Gouws and Søgaard Gouws, S., and Søgaard, A. (2015). Simple task-specific bilingual word embeddings.  In Proceedings of NAACL-HLT, pp. 1302–1306.
  • Graça et al. Graça, J., Pardal, J. P., Coheur, L., and Caseiro, D. (2008). Building a golden collection of parallel multi-language word alignment.  In Proceedings of LREC, pp. 986–993.
  • Guo et al. Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. (2015). Cross-lingual dependency parsing based on distributed representations.  In Proceedings of ACL, pp. 1234–1244.
  • Guo et al. Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. (2016). A representation learning framework for multi-source transfer parsing.  In Proceedings of AAAI, pp. 2734–2740.
  • Gurevych Gurevych, I. (2005). Using the structure of a conceptual network in computing semantic relatedness.  In Proceedings of IJCNLP, pp. 767–778.
  • Gutmann and Hyvärinen Gutmann, M. U., and Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.  Journal of Machine Learning Research, 13(1), 307–361.
  • Haghighi et al. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. (2008). Learning bilingual lexicons from monolingual corpora.  In Proceedings of ACL, pp. 771–779.
  • Hassan and Mihalcea Hassan, S., and Mihalcea, R. (2009). Cross-lingual semantic relatedness using encyclopedic knowledge.  In Proceedings of EMNLP, pp. 1192–1201.
  • Hauer et al. Hauer, B., Nicolai, G., and Kondrak, G. (2017). Bootstrapping unsupervised bilingual lexicon induction.  In Proceedings of EACL, pp. 619–624.
  • Henderson et al. Henderson, M., Thomson, B., and Young, S. J. (2014). Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised adaptation.  In Proceedings of IEEE SLT, pp. 360–365.
  • Hermann and Blunsom Hermann, K. M., and Blunsom, P. (2013). Multilingual distributed representations without word alignment.  In Proceedings of ICLR (Conference Track).
  • Hermann and Blunsom Hermann, K. M., and Blunsom, P. (2014). Multilingual models for compositional distributed semantics.  In Proceedings of ACL, pp. 58–68.
  • Heyman et al. Heyman, G., Vulić, I., and Moens, M. (2016). C-BiLDA: Extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content.  Data Mining and Knowledge Discovery, 30(5), 1299–1323.
  • Heyman et al. Heyman, G., Vulić, I., and Moens, M.-F. (2017). Bilingual lexicon induction by learning to combine word-level and character-level representations.  In Proceedings of EACL, pp. 1085–1095.
  • Hill et al. Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation.  Computational Linguistics, 41(4), 665–695.
  • Holmqvist and Ahrenberg Holmqvist, M., and Ahrenberg, L. (2011). A gold standard for English–Swedish word alignment.  In Proceedings of NODALIDA, pp. 106–13.
  • Hovy et al. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). OntoNotes: The 90% solution.  In Proceedings of NAACL-HLT, pp. 57–60.
  • Irvine and Callison-Burch Irvine, A., and Callison-Burch, C. (2017). A comprehensive analysis of bilingual lexicon induction.  Computational Linguistics, 43(2), 273–310.
  • Jagarlamudi and Daumé III Jagarlamudi, J., and Daumé III, H. (2010). Extracting multilingual topics from unaligned comparable corpora.  In Proceedings of ECIR, pp. 444–456.
  • Ji et al. Ji, H., Nothman, J., Hachey, B., and Florian, R. (2015). Overview of the TAC-KBP2015 entity discovery and linking tasks.  In Proceedings of Text Analysis Conference.
  • Johannsen et al. Johannsen, A., Alonso, H. M., and Søgaard, A. (2015). Any-language frame-semantic parsing.  In Proceedings of EMNLP, pp. 2062–2066.
  • Joubarne and Inkpen Joubarne, C., and Inkpen, D. (2011). Comparison of semantic similarity for different languages using the Google N-gram corpus and second-order co-occurrence measures.  In Proceedings of the Canadian Conference on Artificial Intelligence, pp. 216–221.
  • Kiela and Clark Kiela, D., and Clark, S. (2015). Multi- and cross-modal semantics beyond vision: Grounding in auditory perception.  In Proceedings of EMNLP, pp. 2461–2470.
  • Kiela et al. Kiela, D., Vulić, I., and Clark, S. (2015). Visual bilingual lexicon induction with transferred ConvNet features.  In Proceedings of EMNLP, pp. 148–158.
  • Klementiev et al. Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed representations of words.  In Proceedings of COLING, pp. 1459–1474.
  • Kočiský et al. Kočiský, T., Hermann, K. M., and Blunsom, P. (2014). Learning bilingual word representations by marginalizing alignments.  In Proceedings of ACL, pp. 224–229.
  • Koehn Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation.  In Proceedings of MT Summit, pp. 79–86.
  • Koehn Koehn, P. (2009). Statistical Machine Translation. Cambridge University Press.
  • Lambert et al. Lambert, P., De Gispert, A., Banchs, R., and Mariño, J. B. (2005). Guidelines for word alignment evaluation and manual alignment.  Language Resources and Evaluation, 39(4), 267–285.
  • Lample et al. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition.  arXiv preprint arXiv:1603.01360.
  • Landauer and Dumais Landauer, T. K., and Dumais, S. T. (1997). Solutions to Plato’s problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge.  Psychological Review, 104(2), 211–240.
  • Laroche and Langlais Laroche, A., and Langlais, P. (2010). Revisiting context-based projection methods for term-translation spotting in comparable corpora.  In Proceedings of COLING, pp. 617–625.
  • Lauly et al. Lauly, S., Boulanger, A., and Larochelle, H. (2013). Learning multilingual word representations using a bag-of-words autoencoder.  In Proceedings of the NIPS Workshop on Deep Learning, pp. 1–8.
  • Lazaridou et al. Lazaridou, A., Dinu, G., and Baroni, M. (2015). Hubness and pollution: Delving into cross-space mapping for zero-shot learning.  In Proceedings of ACL, pp. 270–280.
  • Le and Mikolov Le, Q. V., and Mikolov, T. (2014). Distributed representations of sentences and documents.  In Proceedings of ICML, pp. 1188–1196.
  • Leviant and Reichart Leviant, I., and Reichart, R. (2015). Separated by an un-common language: Towards judgment language informed vector space modeling.  CoRR, abs/1508.00106.
  • Levy and Goldberg Levy, O., and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization.  In Proceedings of NIPS, pp. 2177–2185.
  • Levy et al. Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings.  Transactions of the ACL, 3, 211–225.
  • Levy et al. Levy, O., Søgaard, A., and Goldberg, Y. (2017). A strong baseline for learning cross-lingual word embeddings from sentence alignments.  In Proceedings of EACL, pp. 765–774.
  • Li and Jurafsky Li, J., and Jurafsky, D. (2015). Do multi-sense embeddings improve natural language understanding?.  arXiv preprint arXiv:1506.01070.
  • Ling et al. Ling, W., Luis, T., Marujo, L., Astudillo, R. F., Amir, S., Dyer, C., Black, A. W., and Trancoso, I. (2015). Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation.  In Proceedings of EMNLP 2015, pp. 1520–1530.
  • Ling et al. Ling, W., Trancoso, I., Dyer, C., and Black, A. (2016). Character-based Neural Machine Translation.  In ICLR, pp. 1–11.
  • Littman et al. Littman, M., Dumais, S. T., and Landauer, T. K. (1998). Automatic cross-language information retrieval using Latent Semantic Indexing.  In Chapter 5 of Cross-Language Information Retrieval, pp. 51–62. Kluwer Academic Publishers.
  • Lu et al. Lu, A., Wang, W., Bansal, M., Gimpel, K., and Livescu, K. (2015). Deep multilingual correlation for improved word embeddings.  In Proceedings of NAACL-HLT, pp. 250–256.
  • Luong et al. Luong, M.-T., Pham, H., and Manning, C. D. (2015). Bilingual word representations with monolingual quality in mind.  In Proceedings of the Workshop on Vector Modeling for NLP, pp. 151–159.
  • Luong et al. Luong, T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive neural networks for morphology.  In Proceedings of CoNLL, pp. 104–113.
  • Mann and Thompson Mann, W. C., and Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization.  Text-Interdisciplinary Journal for the Study of Discourse, 8(3), 243–281.
  • McDonald et al. McDonald, R., Petrov, S., and Hall, K. (2011). Multi-source transfer of delexicalized dependency parsers.  In Proceedings of EMNLP, pp. 62–72.
  • McDonald et al. McDonald, R. T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K. B., Petrov, S., Zhang, H., Täckström, O., et al. (2013). Universal dependency annotation for multilingual parsing.  In Proceedings of ACL, pp. 92–97.
  • Mihalcea and Csomai Mihalcea, R., and Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge.  In Proceedings of CIKM, pp. 233–242.
  • Mihalcea and Pedersen Mihalcea, R., and Pedersen, T. (2003). An evaluation exercise for word alignment.  In Proceedings of the Workshop on Building and Using Parallel Texts, pp. 1–10.
  • Mikolov et al. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Distributed Representations of Words and Phrases and their Compositionality.  In Proceedings of NIPS, pp. 3111–3119.
  • Mikolov et al. Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation.  CoRR, abs/1309.4168.
  • Miller and Charles Miller, G. A., and Charles, W. G. (1991). Contextual correlates of semantic similarity.  Language and Cognitive Processes, 6(1), 1–28.
  • Mimno et al. Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., and McCallum, A. (2009). Polylingual topic models.  In Proceedings of EMNLP, pp. 880–889.
  • Mitra and Craswell Mitra, B., and Craswell, N. (2017). Neural models for Information Retrieval.  CoRR, abs/1705.01509.
  • Mnih and Teh Mnih, A., and Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models.  Proceedings of ICML, 1751–1758.
  • Mogadala and Rettinger Mogadala, A., and Rettinger, A. (2016). Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification.  In Proceedings of NAACL-HLT, pp. 692–702.
  • Mrkšić et al. Mrkšić, N., Ó Séaghdha, D., Wen, T.-H., Thomson, B., and Young, S. (2017a). Neural belief tracker: Data-driven dialogue state tracking.  In Proceedings of ACL, pp. 1777–1788.
  • Mrkšić et al. Mrkšić, N., Vulić, I., Ó Séaghdha, D., Leviant, I., Reichart, R., Gašić, M., Korhonen, A., and Young, S. (2017b). Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints.  Transactions of the ACL, 5, 309–324.
  • Munteanu and Marcu Munteanu, D. S., and Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora.  In Proceedings of ACL, pp. 81–88.
  • Murthy et al. Murthy, R., Khapra, M., and Bhattacharyya, P. (2016). Sharing network parameters for crosslingual named entity recognition.  CoRR, abs/1607.00198.
  • Myers et al. Myers, J. L., Well, A., and Lorch, R. F. (2010). Research Design and Statistical Analysis. Routledge.
  • Naseem et al. Naseem, T., Snyder, B., Eisenstein, J., and Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches.  Journal of Artificial Intelligence Research, 36, 341–385.
  • Nivre et al. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016a). Universal Dependencies v1: A multilingual treebank collection.  In Proceedings of LREC.
  • Nivre et al. Nivre et al., J. (2016b). Universal Dependencies 1.4.. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague.
  • Peirsman and Padó Peirsman, Y., and Padó, S. (2010). Cross-lingual induction of selectional preferences with bilingual vector spaces.  In Proceedings of NAACL-HLT, pp. 921–929.
  • Peirsman and Padó Peirsman, Y., and Padó, S. (2011). Semantic relations in bilingual lexicons.  ACM Transactions on Speech and Language Processing (TSLP), 8(2), 3.
  • Pennington et al. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation.  In Proceedings of EMNLP, pp. 1532–1543.
  • Pham et al. Pham, H., Luong, M.-T., and Manning, C. D. (2015). Learning distributed representations for multilingual text sequences.  In Proceedings of the Workshop on Vector Modeling for NLP, pp. 88–94.
  • Platt et al. Platt, J. C., Toutanova, K., and Yih, W.-T. (2010). Translingual document representations from discriminative projections.  In Proceedings of EMNLP, pp. 251–261.
  • Prettenhofer and Stein Prettenhofer, P., and Stein, B. (2010). Cross-language text classification using structural correspondence learning.  In Proceedings of ACL, pp. 1118–1127.
  • Rajendran et al. Rajendran, J., Khapra, M. M., Chandar, S., and Ravindran, B. (2016). Bridge correlational neural networks for multilingual multimodal representation learning.  In Proceedings of NAACL-HLT, pp. 171–181.
  • Rapp Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora.  In Proceedings of ACL, pp. 519–526.
  • Rubenstein and Goodenough Rubenstein, H., and Goodenough, J. B. (1965). Contextual correlates of synonymy.  Communications of the ACM, 8(10), 627–633.
  • Schnabel et al. Schnabel, T., Labutov, I., Mimno, D., and Joachims, T. (2015). Evaluation methods for unsupervised word embeddings.  In Proceedings of EMNLP, pp. 298–307.
  • Schultz and Waibel Schultz, T., and Waibel, A. (2001). Language-independent and language-adaptive acoustic modeling for speech recognition.  Speech Communication, 35(1), 31–51.
  • Sennrich et al. Sennrich, R., Haddow, B., and Birch, A. (2015). Neural machine translation of rare words with subword units.  arXiv preprint arXiv:1508.07909.
  • Shezaf and Rappoport Shezaf, D., and Rappoport, A. (2010). Bilingual lexicon generation using non-aligned signatures.  In Proceedings of ACL, pp. 98–107.
  • Shi et al. Shi, T., Liu, Z., Liu, Y., and Sun, M. (2015). Learning cross-lingual word embeddings via matrix co-factorization.  In Proceedings of ACL, pp. 567–572.
  • Smith et al. Smith, S. L., Turban, D. H. P., Hamblin, S., and Hammerla, N. Y. (2017). Bilingual word vectors, orthogonal transformations and the inverted softmax.  In Proceedings of ICLR (Conference Track).
  • Snyder and Barzilay Snyder, B., and Barzilay, R. (2010). Climbing the tower of Babel: Unsupervised multilingual learning.  In Proceedings of ICML, pp. 29–36.
  • Søgaard Søgaard, A. (2016). Evaluating word embeddings with fmri and eye-tracking.  ACL 2016, 116.
  • Søgaard et al. Søgaard, A., Agić, Z., Alonso, H. M., Plank, B., Bohnet, B., and Johannsen, A. (2015). Inverted indexing for cross-lingual NLP.  In Proceedings of ACL, pp. 1713–1722.
  • Sorg and Cimiano Sorg, P., and Cimiano, P. (2012). Exploiting Wikipedia for cross-lingual and multilingual information retrieval.