Learning to Represent Bilingual Dictionaries

Learning to Represent Bilingual Dictionaries

Muhao Chen, Yingtao Tian, Haochen Chen, Kai-Wei Chang, Steven Skiena, Carlo Zaniolo
Department of Computer Science, UCLA
Department of Computer Science, Stony Brook University
{muhaochen, kw2c, zaniolo}@cs.ucla.edu; {yittian, haocchen, skiena}@cs.stonybrook.edu

Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the literal word definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. Experimental evaluation focuses on two applications. The results of the cross-lingual reverse dictionary retrieval task show our model’s promising ability of comprehending bilingual concepts based on descriptions, and highlight the effectiveness of proposed learning strategies in improving performance. Meanwhile, our model effectively addresses the bilingual paraphrase identification problem and significantly outperforms previous approaches.

Learning to Represent Bilingual Dictionaries

Muhao Chen, Yingtao Tian, Haochen Chen, Kai-Wei Chang, Steven Skiena, Carlo Zaniolo Department of Computer Science, UCLA Department of Computer Science, Stony Brook University {muhaochen, kw2c, zaniolo}@cs.ucla.edu; {yittian, haocchen, skiena}@cs.stonybrook.edu

1 Introduction

Bilingual word embedding models are used to capture the cross-lingual semantic relatedness of words based on their co-occurrence in parallel or seed-lexicon corpora (Chandar et al., 2014; Gouws et al., 2015; Luong et al., 2015). By collocating related words in the low-dimensional embedding spaces, these models effectively endow the representations of lexical semantics with precise cross-lingual semantic transfer (Gouws et al., 2015). Therefore, they have been widely used in many cross-lingual NLP tasks including machine translation (Devlin et al., 2014), bilingual document classification (Zhou et al., 2016), knowledge alignment (Chen et al., 2017) and multilingual named entity recognition (Feng et al., 2018).

While cross-lingual lexical semantics are well-captured, modeling the correspondence between lexical and sentential semantics across different languages still represents an unresolved challenge. Modeling such cross-lingual and multi-granular correspondence of semantics is highly beneficial to many application scenarios, including cross-lingual semantic search of concepts (Tsai and Roth, 2016), agents for detecting discourse relations in bilingual dialogue utterances (Jiang et al., 2018), and multilingual text summarization (Nenkova and McKeown, 2012), as well as educational applications for foreign language learners. Meanwhile, consider that there are quite a few foreign words without a directly translated term (Goldberg, 2009) in the source language (e.g. schadenfreude in German, which means a feeling of joy that comes from knowing the troubles of other people, has no proper English counterpart word). To appropriately learn the representation of such words in bilingual embeddings, we need to capture their meanings based on the definitions as well. However, realizing such a model is a non-trivial task, inasmuch as it requires a comprehensive learning process to effectively compose the semantics of arbitrary-length sentences in one language, and associate that with single words in another language. Consequently, this objective also demands high-quality cross-lingual alignment that bridges between single and sequences of words. Such alignment information is generally not available in the parallel and seed-lexicon corpora that are utilized by bilingual word embedding models (Gouws et al., 2015; Lample et al., 2018).

To incorporate the representations of bilingual lexical and sentential semantics, we propose an approach that leverages bilingual dictionaries. The proposed approach BilDRL for Bilingual Dictionary Representation Learning seeks to capture the correspondence between words and the composed meanings of cross-lingual word definitions. BilDRL first constructs a word embedding space with a pre-trained bilingual word embedding model. By utilizing cross-lingual word definitions, a sentence encoder is trained to realize the mapping from literal descriptions to target words in the bilingual word embedding space, for which we investigate with multiple types of encoding techniques. To enhance the cross-lingual learning process on limited resources, BilDRL conducts multi-task learning on different directions of language pairs. Moreover, we enforce a joint learning strategy of bilingual word embeddings and the sentence encoder, which seeks to gradually adjust the embedding space to better suit the characterization of cross-lingual word definitions.

Figure 1: An example illustrating the two cross-lingual tasks. The cross-lingual reverse dictionary retrieval finds cross-lingual target words based on literal descriptions. In terms of cross-lingual paraphrases, the French sentence (which mean any male being considered in relation to his father and mother, or only one of them) essentially describes the same meaning as the English sentence, but has much more content details.

To show the applicability of BilDRL, we conduct experimental evaluation on two useful cross-lingual tasks. (i) Cross-lingual reverse dictionary retrieval seeks to retrieve words or concepts given descriptions in another language (see Fig. 1). This task is useful to help users find foreign words based on the notions or descriptions. This is especially beneficial to users such as translators, foreigner language learners and technical writers using non-native languages. We show that BilDRL  achieves promising results on this task, while bilingual multi-task learning and joint learning dramatically enhance the performance. (ii) Bilingual paraphrase identification asks whether two sentences in different languages essentially express the same meaning, which is critical to question answering or dialogue systems that apprehend multilingual utterances (Bannard and Callison-Burch, 2005). This task is challenging, as it requires a model to comprehend cross-lingual paraphrases that are inconsistent in grammar, content details and word orders (as shown in Fig. 1). BilDRL maps the sentences to the bilingual word embedding space. This process reduces the problem to evaluate the similarity of lexicon embeddings, which can be easily solved by a simple classifier (Vyas and Carpuat, 2016). BilDRL performs well with even a small amount of data, and significantly outperforms previous approaches.

2 Related Work

In this section, we discuss two lines of work that are relevant to this topic.

Bilingual word embeddings. Recently, significant advancement has been made in capturing word embeddings from bilingual corpora, which spans in two families of models: off-line mappings and joint training.

The off-line mapping-based approach is first introduced by Mikolov et al. (2013a). Models of this family fix the structures of pre-trained monolingual word embeddings, while inducing bilingual projections based on seed-lexicon alignment. Some variants of this approach improve the quality of bilingual projections by adding constraints such as orthogonality of transformations, normalization and mean centering of embeddings (Xing et al., 2015; Artetxe et al., 2016). Others adopt canonical correlation analysis to map separated monolingual embeddings to a shared embedding space (Faruqui and Dyer, 2014; Lu et al., 2015).

Unlike off-line mappings, joint training models simultaneously learn word embeddings and cross-lingual alignment. By jointly updating the embeddings with the alignment information, approaches of this family generally capture more precise cross-lingual semantic transfer (Upadhyay et al., 2016). While few of such models still maintain separated embedding spaces for each language (Huang et al., 2015), the majority of recent ones obtain a unified embedding space for both languages. The cross-lingual semantic transfer by these models is captured from parallel corpora with sentential or document-level alignment, using techniques such as bilingual bag-of-words distances (BilBOWA) (Gouws et al., 2015), bilingual Skip-Gram (Coulmance et al., 2015) and sparse tensor factorization (Vyas and Carpuat, 2016).

Neural sentence modeling. Neural sentence models seek to characterize the phrasal or sentential semantics from word sequences. They often adopt encoding techniques such as recurrent neural encoders (RNN) (Kiros et al., 2015), convolutional neural encoders (CNN) (Liu et al., 2017), and attentive neural encoders (Rocktäschel et al., 2016) to represent the composed semantics of a sentence as an embedding vector. Many recent works have focused on comprehending pairwise correspondence of sentential semantics by adopting multiple neural sentence models in one learning architecture. Examples of such include Siamese sentence pair models for detecting discourse relations of paraphrases or text entailment (Sha et al., 2016; Hu et al., 2014; Rocktäschel et al., 2016), and sequence-to-sequence models for tasks like style transfer (Shen et al., 2017) and abstractive summarization (Chopra et al., 2016). Meanwhile, our work is related to corresponding works of neural machine translation (NMT) (Bahdanau et al., 2015; Wu et al., 2016). However, our setting is very different from translation in the following two perspectives: (i) NMT does not capture multi-granular correspondence of semantics like our work; (ii) NMT relies on training an encoder-decoder architecture, while our approach employs joint learning of two representation models, i.e. a dictionary-based sentence encoder and a word embedding model.

On the other hand, fewer efforts have been put to characterizing the associations between sentential and lexical semantics. Hill et al. (2016) and Ji et al. (2017) learn off-line mappings between monolingual descriptions and lexicons to capture such associations. Eisner et al. (2016) adopts a similar approach to capture emojis based on descriptions. At the best of our knowledge, there has been no previous approach that learn to discover the correspondence of sentential and lexicon semantics in a multilingual scenario. This is exactly the focus of our work, in which the proposed learning strategies of multi-task and joint learning are critical to the corresponding cross-lingual learning process. Utilizing the cross-lingual and multi-granular correspondence of semantics, our approach also sheds light on addressing discourse relation detection in a multilingual scenario.

3 Modeling Bilingual Dictionaries

We hereby begin our modeling with the formalization of bilingual dictionaries. We use to denote the set of languages. For each language , denotes its vocabulary of words, where for each word , we use bold-faced to denote its embedding vector. A - bilingual dictionary (or simply ) contains dictionary entries , such that , and is a cross-lingual word definition (or description) that describes the word with a sequence of words in language . For example, a French-English dictionary provides English definitions for French words, where a could be appétite, and corresponding could be desire for, or relish of food or drink. Note that, for a word , multiple definitions in may coexist.

BilDRL is constructed and improved through three stages. A sentence encoder is first used to learn from a bilingual dictionary the association between words and definitions. Then in a pre-trained bilingual word embedding space, multi-task learning is conducted on both directions of a language pair. Lastly, joint learning with word embeddings is enforced to simultaneously adjust the embedding space during the training of the dictionary model, which further enhances the cross-lingual learning process.

3.1 Encoders for Bilingual Dictionaries

BilDRL models a dictionary using a neural sentence encoder , which composes the meaning of the sentence into a latent vector representation. We hereby introduce this model component, which is designed to be a GRU encoder with self-attention, in the following subsection. Besides that, we also investigate other widely-used neural sentence modeling techniques.

Figure 2: Joint learning architecture of BilDRL.

3.1.1 Attentive Gated Recurrent Units

Gated Recurrent units. The GRU encoder is an alternative of the long-short-term memory network (LSTM) (Cho et al., 2014), which consecutively characterizes sequence information without using separated memory cells. Each unit consists of two types of gates to track the state of the sequence, i.e. the reset gate and the update gate . Given the vector representation of an incoming item , GRU updates the hidden state of the sequence as a linear combination of the previous state and the candidate state of new item , which is calculated as below.

The update gate balances between the information of the previous sequence and the new item, where and are two weight matrices, is a bias vector, and is the sigmoid function.

The candidate state is calculated similarly to those in a traditional recurrent unit as below.

The reset gate thereof controls how much information of the past sequence should contribute to the candidate state:

The above defines a GRU layer which outputs a sequence of hidden state vectors given the input sequence . While a GRU encoder can consist of a stack of multiple GRU layers, without an attention mechanism, the last hidden state of the last layer is extracted to represent the overall meaning of the encoded sentence. Note that in comparison to GRU, the traditional LSTM generally performs comparably, but is more complex and require more computational resources for training (Chung et al., 2014).

Self-attention. The self-attention mechanism (Conneau et al., 2017) seeks to capture the overall meaning of the input sentence unevenly from each encoded item. One layer of self-attention is calculated as below.

thereof is the intermediary latent representation of GRU output , is the intermediary latent representation of the last GRU output that can be seen as a high-level representation of the entire input sequence. By measuring the similarity of each with , the normalized attention weight for is produced through a softmax function, which highlights an input that contributes more significantly to the overall meaning. Note that a scalar (the length of the input sequence) is multiplied along with to to obtain the weighted representation , so as to keep from losing the original scale of . A latent representation of the sentence is calculated as the average of the last attention layer .

3.1.2 Other Encoders

We also experiment with other widely used neural sentence modeling techniques, which are however outperformed by the attentive GRU encoder in our tasks. These techniques include the vanilla GRU, CNN (Kalchbrenner et al., 2014), and linear bag-of-words (BOW) (Hill et al., 2016). We briefly introduce the later two techniques in the following.

Convolutional Encoder. A convolutional encoder employs the 1-dimensional weight-sharing convolution layer to encode an input sequence, which applies a kernel to generate a latent representation from an -gram of the input vector sequence by

for which is the kernel size and is a bias vector. A convolution layer applies the kernel to all -grams to produce a sequence of latent vectors , where each latent vector leverages the significant local semantic features from each -gram. Following the convention (Yin and Schütze, 2015; Liu et al., 2017), dynamic max-pooling is applied to extract the robust features from the convolution outputs, and the mean-pooling results of the last layer are used as the latent representation of the sentential semantics. Although CNN leverages well local semantic features from the input sequence, this technique does not preserve the sequential information that is critical to the representation of short sentences.

Linear bag-of-words. Following the definition in previous works (Ji et al., 2017; Hill et al., 2016), this much simpler encoding technique is realized by the sum of projected word embeddings of the input sentence, i.e. .

3.1.3 Learning Objective

The objective of learning the dictionary model is to map the encodings of cross-lingual word definitions to the target word embeddings. This is realized by minimizing the following loss,

in which is the dictionary model that maps from descriptions in to words in .

The above defines the basic model variants of BilDRL that learns on a single dictionary. For word representations in the learning process, BilDRL  initializes the embedding space using pre-trained word embeddings. Note that, without adopting the joint learning strategy in Section 3.3, the learning process does not update word embeddings that are used to represent the definitions and target words. While other forms of loss such as cosine proximity (Hill et al., 2016) and hinge loss (Ji et al., 2017) may also be used in the learning process, we find that loss consistently leads to better performance in our experiments.

3.2 Bilingual Multi-task Learning

In cases where entries in a bilingual dictionary are not amply provided, learning the above bilingual dictionary on one ordered language pair may fall short in insufficiency of alignment information. One practical solution is to conduct a bilingual multi-task learning process. In detail, given a language pair , we learn the dictionary model on both dictionaries and with shared parameters. Correspondingly, we rewrite the learning objective function as below, in which .

This strategy non-trivially requests the same dictionary model to represent semantic transfer between two directions of the language pair. To fulfill such a request, we initialize the embedding space using the bilingual BilBOWA embeddings trained on parallel corpora, which provides a unified bilingual embedding space that resolves both monolingual and cross-lingual semantic relatedness of words.

In practice, we find that this simple multi-task strategy brings significant improvement to our cross-lingual tasks. In additional to BilBOWA, other jointly trained bilingual word embeddings may also be used to support this strategy (Coulmance et al., 2015; Vyas and Carpuat, 2016), for which we leave the comparison to future work.

3.3 Joint Learning Objective

While above learning strategies of BilDRL are based on a fixed embedding space, we lastly propose a joint learning strategy. During the training process, this strategy simultaneously updates the embeddings based on both the dictionary model and the bilingual word embedding model. The learning is through asynchronous minimization of the following joint objective function,

where and are two positive coefficients. and are the original Skip-Gram losses (Mikolov et al., 2013b) employed by BilBOWA to separately obtain word embeddings on monolingual corpora of languages and . is the alignment loss that minimizes the bag-of-words distances for aligned sentence pairs from the bilingual parallel corpora , which is termed as below.

The joint learning process adapts the embedding space to better suit the dictionary model, which is shown to further enhance the cross-lingual learning of BilDRL. The learning architecture of all model components is depicted in Fig. 2

3.4 Training

To initialize the embedding space, we pre-trained BilBOWA on the parallel corpora Europarl v7 (Koehn, 2005) and monolingual corpora of tokenized Wikipedia dump (Al-Rfou et al., 2013).

For models without joint learning, we use AMSGrad (Reddi et al., 2018) to optimize the parameters. Each model without bilingual multi-task learning thereof, is trained on batched samples from each individual dictionary. Multi-task learning models are trained on batched samples from two dictionaries. Within each batch, entries of different directions of languages can be mixed together. For joint learning, we follow previous works (Gouws et al., 2015; Mogadala and Rettinger, 2016) to conduct an efficient multi-threaded asynchronous training (Mnih et al., 2016) of AMSGrad. In detail, after initializing the embedding space based on pre-trained BilBOWA, parameter updating based on the four components of occurs across four worker threads. Two monolingual threads select batches of monolingual contexts from the Wikipedia dump of two languages for Skip-Gram, one alignment thread randomly samples parallel sentences from Europarl v7, and one dictionary thread extracts batched samples of entries for a bilingual multi-task dictionary model. Each thread makes a batched update to model parameters asynchronously for each component of . The asynchronous training of all four worker threads keeps going until the dictionary thread finishes its epochs.

4 Experiments

In this section we present the experimental evaluation of BilDRL on two cross-lingual tasks: the cross-lingual reverse dictionary retrieval task and the bilingual paraphrase identification task.

Datasets. The experiment of cross-lingual reverse dictionary retrieval is conducted on a trilingual dataset of dictionaries denoted Wikt3l. This dataset is extracted from Wiktionary111https://www.wiktionary.org/, which is one of the largest freely available multilingual dictionary resources on the Web. Wikt3l contains dictionary entries of language pairs (English, French) and (English, Spanish), which form En-Fr, Fr-En, En-Es and Es-En dictionaries on four bridges of languages in total. Two types of bilingual dictionary entries are extracted from Wiktionary: (i) cross-lingual definitions provided under the Translations sections of Wiktionary pages; (ii) monolingual definitions for words that are linked to a cross-lingual counterpart with a inter-language link222https://en.wikisource.org/wiki/Help:Interlanguage_links of Wiktionary. The statistics of Wikt3l are given in Table 1. Note that the dictionary extraction process excludes all the definitions of stop words in each of the three languages.

Since existing datasets for paraphrase identification are merely monolingual, we contribute with another dataset for cross-lingual sentential paraphrase identification. This dataset contains 6k pairs of bilingual sentence pairs respectively for En-Fr and En-Es settings. Within each language setting, 3k positive cases are formed as pairs of descriptions aligned by inter-language links, which are unknown to Wikt3l for training BilDRL. We create negative cases by pairing the English definition of a source word with the cross-lingual definition of another word within the 6th to 15th nearest neighbors of the source word in the embedding space. This process ensures that each negative case is endowed with limited dissimilarity of sentence meanings, which makes the decision more challenging. For each language setting, we randomly select 75% for training, 5% for validation, and the rest 20% for testing. Note that each language setting of this dataset thereof, matches with the quantity and partitioning of sentence pairs in the widely-used Microsoft Research Paraphrase Corpus benchmark for monolingual paraphrase identification (Hu et al., 2014; Yin et al., 2016; Das and Smith, 2009).

Dictionary En-Fr Fr-En En-Es Es-En
#Target words 15,666 16,857 8,004 16,986
#Definitions 50,412 58,808 20,930 56,610
Table 1: Statistics of the Wikt3l dataset.
Languages En-Fr Fr-En En-Es Es-En
BOW 0.8 3.4 0.011 0.4 2.2 0.006 0.4 2.4 0.007 0.4 2.6 0.007
CNN 6.0 12.4 0.070 6.4 14.8 0.072 3.8 7.2 0.045 7.0 16.8 0.088
GRU 35.6 46.0 0.380 38.8 49.8 0.410 47.8 59.0 0.496 57.6 67.2 0.604
ATT 38.8 47.4 0.411 39.8 50.2 0.425 51.6 59.2 0.534 60.4 68.4 0.629
GRU-mono 21.8 33.2 0.242 27.8 37.0 0.297 34.4 41.2 0.358 36.8 47.2 0.392
ATT-mono 22.8 33.6 0.249 27.4 39.0 0.298 34.6 42.2 0.358 39.4 48.6 0.414
GRU-MTL 43.4 49.2 0.452 44.4 52.8 0.467 50.4 60.0 0.530 63.6 71.8 0.659
ATT-MTL 46.8 56.6 0.487 47.6 56.6 0.497 55.8 62.2 0.575 66.4 75.0 0.687
Table 2: Cross-lingual reverse dictionary retrieval results by BilDRL variants. We report , , and on four groups of models: (i) basic dictionary models that adopt four different encoding techniques; (ii) models with the two best encoding techniques that enforce the monolingual retrieval approach; (iii) models adopting bilingual multi-task optimization; (iv) joint optimization that employs the best dictionary model of ATT-MTL.

4.1 Cross-lingual Reverse Dictionary Retrieval

The objective of this task is to enable cross-lingual retrieval of words based on descriptions. Besides comparing variants of BilDRL that adopt different sentence encoders and learning strategies, we also compare with the monolingual retrieval approach proposed by Hill et al. (2016). Instead of directly associating cross-lingual word definitions, this approach learns sentence-to-word mappings in a monolingual scenario. This approach is further extended to bilingual cases via retrieving semantically related words in another language using bilingual word embeddings.

Evaluation Protocol. Before training the models, we randomly select 500 defined words from each dictionary respectively as test cases, and exclude all the definitions of these words from the rest training data. Each of the basic BilDRL variants are trained on one bilingual dictionary. The monolingual retrieval models are trained to fit the target words in the original languages of the word definitions, which are also provided in Wiktionary. BilDRL variants with multi-task or joint learning use both dictionaries on the same language pair. In the test phase, for each test case , the prediction by each model is to perform a kNN search from the corresponding definition encoding , and record the rank of within the vocabulary of . We limit the vocabularies to all the words that appear in the Wikt3l dataset, which involve around 45k English words, 44k French words and 36k Spanish words. We aggregate three metrics on test cases: the accuracy (%), the proportion of ranks no larger than 10 (%), and mean reciprocal rank .

We pre-train BilBOWA based on the original configuration in (Gouws et al., 2015) and obtain 50-dimensional initialization of bilingual word embedding spaces respectively for the English-French and English-Spanish settings. For CNN, GRU, and attentive GRU (ATT) encoders, we stack five of each corresponding encoding layers with hidden-sizes of 200, and two affine layers are applied to the final output for dimension reduction. This encoder architecture consistently represents the best performance through our tuning. Through comprehensive hyperparameter tuning, we fix the learning rate to 0.0005, the exponential decay rates of AMSGrad and to 0.9 and 0.999, coefficients and to both 0.1, and batch size to 64. Kernel-size and pooling-size are both set to 2 for CNN. Word definitions are zero-padded (short ones) or truncated (long ones) to the sequence length of 15. Training is limited to 1,000 epochs for all models as well as the dictionary thread of asynchronous joint learning.

Results. Results are reported in Table 2 in four groups. The first group compares four different encoding techniques for the basic dictionary models. GRU thereof consistently outperforms CNN and BOW that fail to capture the important sequential information for descriptions. ATT that weighs among the hidden states has notable improvements over GRU. While we equip the two better encoding techniques with the monolingual retrieval approach, we find that the way of learning the dictionary models towards monolingual targets and retrieving cross-lingual related words incurs more impreciseness to the task. For models of the third group that conduct multi-task learning in two directions of a language pair, the results show significant enhancement of performance in both directions. For the final group of results, we incorporate the best variant of multi-task learnt dictionary models into the joint learning architecture, which leads to compelling improvement of the task on all settings. This demonstrates that properly adapting the word embeddings in joint with the bilingual dictionary model efficaciously construct the embedding space that suits better the representation of both bilingual lexical and sentential semantics.

In general, this experiment has identified the proper encoding techniques of the dictionary model. The proposed strategies of multi-task and joint learning effectively contribute to the precise characterization of the cross-lingual correspondence of lexical and sentential semantics, which have led to very promising capability of cross-lingual reverse dictionary retrieval.

Languages En&Fr En&Es
BiBOW 54.53 55.86
BiCNN 54.13 53.73
ABCNN 56.93 56.33
BiGRU 56.33 57.93
BiATT 59.80 62.27
BilDRL-GRU-MTL 61.20 63.33
BilDRL-ATT-MTL 63.87 66.00
Table 3: Accuracy of bilingual paraphrase detection. For BilDRL, the results by three model variants are reported: BilDRL-GRU-MTL and BilDRL-ATT-MTL are models with bilingual multi-task optimization, and BilDRL-ATT-joint is the best variant that incorporates attentive GRU-based dictionary model with multi-task and joint optimization.

4.2 Bilingual Paraphrase Identification

The next task is a binary classification problem with the goal to decide whether two sentences of different languages express the same meanings. BilDRL provides an effective solution by transferring sentential meanings to lexicon-level representations and learning a simple classifier. We evaluate three variants of BilDRL on this task, i.e. the multi-task ones with GRU and attentive GRU encoders, and the jointly optimized one with attentive encoders. We compare against several baselines of neural sentence pair models that are proposed to tackle monolingual paraphrase identification. These models include Siamese structures of CNN (BiCNN) (Yin and Schütze, 2015), RNN (BiGRU) (Mueller and Thyagarajan, 2016), attentive CNN (ABCNN) (Yin et al., 2016), attentive GRU (BiATT) (Rocktäschel et al., 2016), and linear bag-of-words (BiBOW). To support the reasoning of cross-lingual sentential semantics, we provide the baselines with the same BilBOWA embeddings.

Evaluation protocol. BilDRL transfers each sentence into a vector in the bilingual word embedding space. Then, for each sentence pair in the train set, a Multi-layer Perceptron (MLP) with a binary softmax loss is trained on the subtraction of two vectors as a downstream classifier. Baseline models are trained end-to-end, each of which directly uses a parallel pair of encoders with shared parameters. Then an MLP is also stacked to the subtraction of two sentence vectors. Note that some works use concatenation (Yin and Schütze, 2015) or Manhattan distance (Mueller and Thyagarajan, 2016) of sentence vectors instead of their subtraction, which we find to be less effective on the small amount of data.

We apply the configurations of the sentence encoders from the last experiment to corresponding baselines, so as to show the performance under controlled variables. Training of each classifier is terminated by early-stopping based on the validation set, so as to prevent overfitting.

Results. As shown by the results in Table 3, this task is challenging due to the heterogeneity of cross-lingual paraphrases and limitedness of learning resources. Among all the baselines, while BiATT consistently outperforms the others, it is still not able to reach 60% of accuracy on the En-Fr setting, and slightly over 60% on the En-Es setting. BilDRL however effectively utilizes the correspondence of lexical and sentential semantics to simplify the task to an easier entailment task in the lexical space, for which BilDRL-ATT-MTL outperforms the best baseline respectively by 4.07% and 3.73% of accuracy in both language settings. BilDRL-ATT-joint further improves the task by another satisfying 2.66% and 2.61% of accuracy.

5 Conclusion and Future Work

In this paper, we propose a neural embedding model BilDRL that captures the correspondence of cross-lingual lexical and sentential semantics by learning to represent bilingual dictionaries. We experiment with multiple forms of neural models to capture the cross-lingual word definitions, from which the best representation techniques are identified. The two learning strategies, i.e. bilingual multi-task learning and joint learning are effective at enhancing the cross-lingual learning process with limited resources. Our model has achieved promising performance on cross-lingual reverse dictionary retrieval as well as cross-lingual paraphrase identification tasks by utilizing the captured correspondence of lexical and sentential semantics.

An important direction of future work lies in bilingual word embeddings. Existing models either use seed-lexicons or sentence alignments to capture cross-lingual semantic transfer, we are interested in exploring whether that phase of learning word embeddings can be improved by incorporating the lexicon-sentence alignment used in this work. Applying BilDRL to multilingual dialogue and question answering systems is another important direction.


  • Al-Rfou et al. (2013) Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Plyglot: Distributed word representations for multilingual nlp. In The SIGNLL Conference on Computational Natural Language Learning.
  • Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pages 2289–2294.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  • Bannard and Callison-Burch (2005) Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pages 597–604.
  • Chandar et al. (2014) Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. pages 1853–1861.
  • Chen et al. (2017) Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2017. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. pages 1511–1517.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pages 1724–1734.
  • Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 93–98.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, et al. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv .
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 670–680.
  • Coulmance et al. (2015) Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, and Amine Benhalloum. 2015. Trans-gram, fast cross-lingual word-embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1109–1113.
  • Das and Smith (2009) Dipanjan Das and Noah A Smith. 2009. Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, pages 468–476.
  • Devlin et al. (2014) Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1370–1380.
  • Eisner et al. (2016) Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, and Sebastian Riedel. 2016. emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359 .
  • Faruqui and Dyer (2014) Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. pages 462–471.
  • Feng et al. (2018) Xiaocheng Feng, Xiachong Feng, Bing Qin, Zhangyin Feng, and Ting Liu. 2018. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.
  • Goldberg (2009) Nathaniel Goldberg. 2009. Triangulation, untranslatability, and reconciliation. Philosophia 37(2):261–280.
  • Gouws et al. (2015) Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. Bilbowa: Fast bilingual distributed representations without word alignments. In Proceedings of the 32nd International Conference on Machine Learning. pages 748–756.
  • Hill et al. (2016) Felix Hill, KyungHyun Cho, Anna Korhonen, and Yoshua Bengio. 2016. Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics 4:17–30.
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems. pages 2042–2050.
  • Huang et al. (2015) Kejun Huang, Matt Gardner, Evangelos Papalexakis, Christos Faloutsos, Nikos Sidiropoulos, Tom Mitchell, Partha P Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1084–1088.
  • Ji et al. (2017) Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Proceedings of the AAAI International Conference on Artificial Intelligence. pages 3060–3066.
  • Jiang et al. (2018) Jyun-Yu Jiang, Francine Chen, Yan-Ying Chen, and Wei Wang. 2018. Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom, Dimitri Kartsaklis, Nal Kalchbrenner, Mehrnoosh Sadrzadeh, Nal Kalchbrenner, Phil Blunsom, Nal Kalchbrenner, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 212–217.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. pages 3294–3302.
  • Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit. volume 5, pages 79–86.
  • Lample et al. (2018) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Hervé Jégou, et al. 2018. Word translation without parallel data. In International Conference on Learning Representations.
  • Liu et al. (2017) Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pages 115–124.
  • Lu et al. (2015) Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 250–256.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. pages 151–159.
  • Mikolov et al. (2013a) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 .
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, et al. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems.
  • Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. pages 1928–1937.
  • Mogadala and Rettinger (2016) Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 692–702.
  • Mueller and Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI Conference on Artificial Intelligence. pages 2786–2792.
  • Nenkova and McKeown (2012) Ani Nenkova and Kathleen McKeown. 2012. A survey of text summarization techniques. In Mining text data, Springer, pages 43–76.
  • Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. In International Conference on Learning Representations.
  • Rocktäschel et al. (2016) Tim Rocktäschel, Edward Grefenstette, et al. 2016. Reasoning about entailment with neural attention.
  • Sha et al. (2016) Lei Sha, Baobao Chang, et al. 2016. Reading and thinking: Re-read lstm unit for textual entailment recognition. In Proceedings of the International Conference on Computational Linguistics.
  • Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems. pages 6833–6844.
  • Tsai and Roth (2016) Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 589–598.
  • Upadhyay et al. (2016) Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1661–1670.
  • Vyas and Carpuat (2016) Yogarshi Vyas and Marine Carpuat. 2016. Sparse bilingual word representations for cross-lingual lexical entailment. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 1187–1197.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 .
  • Xing et al. (2015) Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 1006–1011.
  • Yin and Schütze (2015) Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Yin et al. (2016) Wenpeng Yin, Hinrich Schütze, et al. 2016. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4(1).
  • Zhou et al. (2016) Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Cross-lingual sentiment classification with bilingual document representation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1403–1412.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description