Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.
In recent years, there has been a huge and significant amount of research in text and image retrieval tasks which needs the joint modeling of both modalities. Further, a large number of image-text datasets have become available (elliott:16; hodosh:13; young:14; lin:14), and several models have been proposed to generate captions for images in the dataset lu:18; bernardi:16; anderson:17; jilu:16; mao:15; rennie:16. There has been a great amount of research in learning a joint embedding space for texts and images in order to use the model in sentence-based image search or cross-modal retrieval task frome:13; kiros:15; donahue:15; angeliki:15; socher:14; hodosh:13; karpathy:14.
Previous works in image-caption task and learning a joint embedding space for texts and images are mostly related to English language, however, recently there is a large amount of research in other languages due to the availability of multilingual datasets funaki:15; elliott:16; rajendran:16; miyazaki:16; specia:16; young:14; hitschler:16; yoshikawa:17. The aim of these models is to map images and their captions in a single language into a joint embedding space rajendran:16; calixto:17.
Related to our work, \newcitegella:17 proposed a model to learn a multilingual multimodal embedding by utilizing an image as a pivot between languages of captions. While a text encoder is trained for each language in \newcitegella:17, we propose instead a model that learns a shared and language-independent text encoder between languages, yielding better generalization. It is generally important to adapt word embeddings for the task at hand. Our model enables tuning of word embeddings while keeping the two languages aligned during training, building a task-specific shared embedding space for existing languages.
In this attempt, we define a new objective function that combines a pairwise ranking loss with a loss that maintains the alignment in multiple languages. For the latter, we use the objective function proposed in \newcitejoulin:18 for learning a linear mapping between languages inspired by cross-domain similarity local scaling (CSLS) retrieval criterion conneau:17 which obtains the state-of-the-art performance on word translation task.
In the next sections, the proposed approach is called Aligning Multilingual Embeddings for cross-modal retrieval (AME). With experiments on two multimodal multilingual datasets, we show that AME outperforms existing models on text-image multimodal retrieval tasks. The code we used to train and evaluate the model is available at https://github.com/alirezamshi/AME-CMR
We use two multilingual image-caption datasets to evaluate our model, Multi30k and Microsoft COCO elliott:16; lin:14.
Multi30K is a dataset with 31’014 German translations of English captions and 155’070 independently collected German and English captions. In this paper, we use independently collected captions which each image contains five German and five English captions. The training set includes 29’000 images. The validation and test sets contain 1’000 images.
MS-COCO lin:14 contains 123’287 images and five English captions per image. \newciteyoshikawa:17 proposed a model which generates Japanese descriptions for images. We divide the dataset based on \newcitekarpathy:15. The training set contains 113’287 images. Each validation and test set contains 5’000 images.
3 Problem Formulation
3.1 Model for Learning a Multilingual Multimodal Representation
Assume image and captions and are given in two languages, and respectively. Our aim is to learn a model where the image and its captions and are close in a joint embedding space of dimension . AME consists of two encoders and , which encode images and captions. As multilingual text encoder, we use a recurrent neural network with gated recurrent unit (GRU). For the image encoder, we use a convolutional neural network (CNN) architecture. The similarity between a caption and an image in the joint embedding space is measured with a similarity function . The objective function is as follows (inspired by \newcitegella:17):
Where stands for both languages, and is the margin. and are irrelevant caption and image of the gold-standard pair .
3.2 Alignment Model
Each word in the language is defined by a word embedding ( in the language respectively). Given a bilingual lexicon of pairs of words, we assume the first pairs are the initial seeds, and our aim is to augment it to all word pairs that are not in the initial lexicons. \newcitemikolov:13 proposed a model to learn a linear mapping between the source and target languages:
Where is a square loss. One can find the translation of a source word in the target language by performing a nearest neighbor search with Euclidean distance. But, the model suffers from a ”hubness problem”: some word embeddings become uncommonly the nearest neighbors of a great number of other words george:98; dinu:14.
In order to resolve this issue, \newcitejoulin:18 proposed a new objective function inspired by CSLS criterion to learn the linear mapping:
Where means the -nearest neighbors of in the set of source language . They constrained the linear mapping to be orthogonal, and word vectors are -normalized.
The whole loss function is the equally weighted summation of the aforementioned objective functions:
We use two different similarity functions, symmetric and asymmetric. For the former, we use the cosine similarity function and for the latter, we use the metric proposed in \newcitevendrov:16, which encodes the partial order structure of the visual-semantic hierarchy. The metric similarity is defined as:
Where and are the embeddings of image and caption.
4 Experiment and Results
4.1 Details of Implementation 111In this section, the hyper-parameters in parentheses are related to the model trained on MS-COCO.
We use a mini-batch of size 128. We use Adam optimizer with learning rate 0.00011 (0.00006) and with early stopping on the validation set. We set the dimensionality of joint embedding space and the GRU hidden layer to . We utilize the pre-trained aligned word vectors of FastText for the initial word embeddings. For Japanese word embedding, we use pre-trained word vectors of FastText333Available at https://fasttext.cc/docs/en/crawl-vectors.html, and https://fasttext.cc/docs/en/aligned-vectors.html., then align it to the English word embedding with the same hyper-parameters used for MS-COCO. We set the margin and for symmetric and asymmetric similarity functions respectively.
We assign -nearest neighbors to be 5 (4). We set , and . We tokenize English and German captions with Europarl tokenizer koehn:05. For the Japanese caption, we use Mecab analyzer kudo:04. We train the model for 30 (20) epochs with updating the learning rate (divided by 10) on epoch 15 (10).
To extract features of images, we use a ResNet152 he:15 CNN architecture pre-trained on Imagenet and extract the image features from FC7, the penultimate fully connected layer. We use average features from 10-crop of the re-scaled images.
|EN DE||DE EN|
For the metric of alignment, we use bilingual lexicons of Multilingual Unsupervised and Supervised Embeddings (MUSE) benchmark lample:17. MUSE is a large-scale high-quality bilingual dictionaries for training and evaluating the translation task. We extract the training words of descriptions in two languages. For training, we combine ”full” and ”test” sections of MUSE, then filter them to the training words. For evaluation, we filter ”train” section of MUSE to the training words. 444You can find the code for building bilingual lexicons on the Github link.
For evaluating the benefit of the proposed objective function, we compare AME with monolingual training (Mono), and multilingual training without the alignment model described in Section 3.2. For the latter, the pre-aligned word embeddings are frozen during training (FME). We add Mono since the proposed model in \newcitegella:17 did not utilize pre-trained word embeddings for the initialization, and the image encoder is different (ResNet152 vs. VGG19).
We compare models based on two retrieval metrics, recall at position k (R@k) and Median of ranks (Mr).
4.2 Multi30k Results
In Table 1 and 2, we show the results for English and German captions. For English captions, we see 21.28% improvement on average compared to \newcitekiros:15. There is a 1.8% boost on average compared to Mono due to more training data and multilingual text encoder. AME performs better than FME model on both symmetric and asymmetric modes, which shows the advantage of fine-tuning word embeddings during training. We have 25.26% boost on average compared to \newcitekiros:15 in asymmetric mode.
For German descriptions, The results are 11.05% better on average compared to gella:17 in symmetric mode. AME also achieves competitive or better results than FME model in German descriptions too.
4.3 MS-COCO Results555To compare with baselines, scores are measured by averaging 5 folds of 1K test images.
In Table 3 and 4, we show the performance of AME and baselines for English and Japanese captions. We achieve 10.42% improvement on average compared to \newcitekiros:15 in the symmetric manner. We show that adapting the word embedding for the task at hand, boosts the general performance, since AME model significantly outperforms FME model in both languages.
For the Japanese captions, AME reaches 6.25% and 3.66% better results on average compared to monolingual model in symmetric and asymmetric modes, respectively.
4.4 Alignment results
In Tables 1 and 2, we can see that the alignment ratio for AME is 6.80% lower than FME which means that the translators can almost keep languages aligned in Multi30k dataset. In MS-COCO dataset, the alignment ratio for AME is 8.93% lower compared to FME.
We compute the alignment ratio and recall at position 1 (R@1) in each validation step. Figure 2 shows the trade-off between alignment and retrieval tasks. At the first few epochs, the model improves the alignment ratio since the retrieval task hasn’t seen enough number of instances. Then, the retrieval task tries to fine-tune word embeddings. Finally, they reach an agreement near the half of training process. At this point, we update the learning rate of retrieval task to improve the performance, and the alignment ratio preserves constant.
Additionally, we also train AME model without adding the alignment objective function, and the model breaks the alignment between the initial aligned word embeddings, so it’s essential to add the alignment objective function to the retrieval task.
4.5 Caption-Caption Similarity Scores
Given the caption in a language, the task is to retrieve the related caption in another language. In Table 5, we show the performance on Multi30k dataset in asymmetric mode. AME outperforms the FME model, confirming the importance of word embeddings adaptation.
We proposed a multimodal model with a shared multilingual text encoder by adapting the alignment between languages for image-description retrieval task while training. We introduced a loss function which is a combination of a pairwise ranking loss and a loss that maintains the alignment of word embeddings in multiple languages. Through experiments with different multimodal multilingual datasets, we have shown that our approach yields better generalization performance on image-to-text and text-to-image retrieval tasks, as well as caption-caption similarity task.
In the future work, we can investigate on applying self-attention models like Transformer vaswani2017attention on the shared text encoder to find a more comprehensive representation for descriptions in the dataset. Additionally, we can explore the effect of a weighted summation of two loss functions instead of equally summing them together.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.