MULE: Multimodal Universal Language Embedding
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate the MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations to take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations.
Vision-language understanding has been an active area of research addressing many tasks such as image captioning [Fang et al.2015, Gu et al.2018], visual question answering [Antol et al.2015, Goyal et al.2017], image-sentence retrieval [Wang et al.2019, Nam, Ha, and Kim2017], and phrase grounding [Plummer et al.2015, Hu et al.2016]. Recently there has been some attention paid to expanding beyond developing monolingual (typically English-only) methods by also supporting a second language in the same model (e.g., \citeauthorgellaEMNLP2017,hitschler-etal-2016-multimodal,rajendran2015bridge,calixto2017multilingual,li2019coco,Lan:2017:FCI:3123266.3123366 \citeyeargellaEMNLP2017,hitschler-etal-2016-multimodal,rajendran2015bridge,calixto2017multilingual,li2019coco,Lan:2017:FCI:3123266.3123366). However, these methods often learn completely separate language representations to relate to visual data, resulting in many language-specific model parameters that grow linearly with the number of supported languages.
In this paper, we propose a Multimodal Universal Language Embedding (MULE), an embedding that has been visually-semantically aligned across many languages. Since each language is embedded into to a shared space, we can use a single task-specific multimodal model, enabling our approach to scale to support many languages. Most prior works use a vision-language model that supports at most two languages with separate language branches (e.g. \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017), significantly increasing the number of parameters compared to our work (see Fig. 1 for a visualization). A significant challenge of multilingual embedding learning is the considerable disparity in the availability of annotations between different languages. For English, there are many large-scale vision-language datasets to train a model such as MSCOCO [Lin et al.2014] and Flickr30K [Young et al.2014], but there are few datasets available in other languages, and some contain limited annotations (see Table 1 for a comparison of the multilingual datasets used to train MULE). One could simply use Neural Machine Translation (e.g. \citeauthorbahdanau2014neural,sutskever2014sequence \citeyearbahdanau2014neural,sutskever2014sequence) to convert the sentence from the original language to a language with a trained model, but this has two significant limitations. First, machine translations are not perfect and introduce some noise, making vision-language reasoning more difficult. Second, even with a perfect translation, some information is lost going between languages. For example, “她们” is used to refer to a group of women in Chinese. However, it is translated to “they” in English, losing all gender information that could be helpful in a downstream task. Instead of fully relying on translations, we introduce a scalable approach that supports queries from many languages in a single model.
|Dataset||Language||# images||# descriptions|
An overview of the architecture we use to train MULE is provided in Fig. 2. For each language we use a single fully-connected layer on top of each word embedding to project it into an embedding space shared between all languages, i.e., our MULE features. Training our embedding consists of three components. First, we use an adversarial language classifier in order to align feature distributions between languages. Second, motivated by the sentence-level supervision used to train language embeddings [Devlin et al.2018, Kiela et al.2018, Lu et al.2019], we incorporate visual-semantic information by learning how to match image-sentence pairs using a multimodal network similar to \citeauthorwang2019learning \citeyearwang2019learning. Third, we ensure semantically similar sentences are embedded close to each other (referred to as neighborhood constraints in Fig. 2). Since MULE does not require changes to the architecture of the multimodal model like prior work (e.g., \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017), our approach can easily be incorporated to other multimodal models.
Despite being trained to align languages using additional large corpora of text data across each supported language, our experiments will show recent multilingual embeddings like MUSE [Conneau et al.2018] perform significantly worse on tasks like multilingual image-sentence matching than our approach. 111https://github.com/facebookresearch/MUSE In addition, sharing all the parameters of the multimodal component of our network enables languages with fewer annotations to take advantage of the stronger representation learned using more data. Thus, as our experiments will show, MULE obtains its largest performance gains on languages with less training data. This gain is boosted further by using Neural Machine Translation as a data augmentation technique to increase the available vision-language training data.
We summarize our contributions as follows:
We propose MULE, a multilingual text representation for vision-language tasks that can transfer and learn textual representations for low-resourced languages from label-rich languages, such as English.
We demonstrate MULE’s effectiveness on a multilingual image-sentence retrieval task, where we outperform extensions of prior work by up to 20.2% on a single language while also using fewer model parameters.
We show that using Machine Translation is a beneficial data augmentation technique for training multilingual embeddings for vision-language tasks.
Language Representation Learning. Word embeddings, such as Word2Vec [Mikolov, Yih, and Zweig2013] and FastText [Bojanowski et al.2017], play an important role in vision-language tasks. These word embeddings provide a mapping function from a word to an n-dimensional vector where semantically similar words are embedded close to each other and are typically trained using language-only data. However, recent work has demonstrated a disconnect between how these embeddings are evaluated and the needs of vision-language tasks [Burns et al.2019]. Thus, several recent methods have obtained significant performance gains across many tasks over language-only trained counterparts by learning the visual-semantic meaning of words specifically for use in vision-language problems [Kottur et al.2016, Kiela et al.2018, Burns et al.2019, Lu et al.2019, Gupta, Schwing, and Hoiem2019, Nguyen and Okatani2019, Tan and Bansal2019]. All these methods have addressed embedding learning only in the monolingual (English-only) setting, however, and none of the methods that align representations across many languages were designed specifically for vision-language tasks (e.g. \citeauthorconneau2017word,rajendran2015bridge,calixto2017multilingual \citeyearconneau2017word,rajendran2015bridge,calixto2017multilingual). Thus, just as in the monolingual setting, and verified in our experiments, these multilingual, language-only trained embeddings do not generalize as well to vision-language tasks as the visually-semantically aligned multilingual embeddings in our approach.
Image-Sentence Retrieval. The goal of this task is to retrieve relevant images given a sentence query and vice versa. Although there has been considerable attention given to this task, nearly all have focused on supporting queries in a single language, which is nearly always English (e.g. \citeauthornam2017dual,wang2019learning \citeyearnam2017dual,wang2019learning). These models tend to either learn an embedding between image and text features (e.g. \citeauthorplummer2015flickr30k,wang2019learning \citeyearplummer2015flickr30k,wang2019learning) or sometimes directly learn a similarity function (e.g. \citeauthorwang2019learning \citeyearwang2019learning). Most relevant to our work is \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017 who propose a cross-lingual model, which uses an image as a pivot and enforce the sentence representations from English and German to be similar to the pivot image representation, similar to the structure-preserving constraints of \citeauthorwang2019learning \citeyearwang2019learning. However, in \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017 each language is modeled with a completely separate language model. While this may be acceptable for modeling one or two languages, it would not scale well for representing many languages as the number of parameters would grow too large. Instead, we learn a shared representation between all languages, enabling us to scale to many languages with few additional parameters, while enabling feature sharing with lower-resource languages.
Neural Machine Translation. In Neural Machine Translation (NMT) the goal is to translate text from one language to another language with parallel text corpora [Bahdanau, Cho, and Bengio2014, Sutskever, Vinyals, and Le2014, Johnson et al.2017]. \citeauthorjohnson2017google \citeyearjohnson2017google proposed a multilingual NMT model, which uses a single model with an encoder-decoder architecture. They observed that translation quality on low-resourced languages can be improved when trained with label-rich languages. As discussed in the Introduction, and verified in our experiments, directly using NMT for vision-language tasks has some limitations in its usefulness for vision-language tasks, but it can provide additional benefits combined with our method.
Visual-Semantic Multilingual Alignment
In this section we describe how we train MULE, a lightweight multilingual embedding which is visually-semantically aligned across many languages and can easily be incorporated into many vision-language tasks and models. Each word in some language input is encoded using a continuous vector representation, which is then projected to the shared language embedding (MULE) using a language-specific fully connected layer. In our experiments, we initialize our word embeddings from 300-dimensional monolingual FastText embeddings [Bojanowski et al.2017]. The word embeddings and these fully connected layers are the only language-specific parameters in our network. Due to their compact size, they can easily scale to a large vocabulary encompassing many languages.
To train MULE, we use paired and unpaired sentences between the languages from annotated vision-language datasets. We find that we get the best performance by first pretraining MULE with paired sentences before fine-tuning using the multimodal layers with the multi-layer neighboring constraints described in (Eq. 1) and the adversarial language classifier described below. While our experiments focus solely on utilizing multimodal data, one could also try to integrate large text corpora with annotated language pairs (e.g. \citeauthorconneau2017word \citeyearconneau2017word). However, as our experiments will show, only using generic language pairs for this alignment (i.e., not sentences related to images) results in some loss of information that is important for vision-language reasoning. We will now discuss the three major components of our loss used to train our embedding as shown in Fig 2.
Multi-Layer Neighborhood Constraints
During training we assume we have paired sentences obtained from the vision-language annotations, i.e., sentences that describe the same image. These sentences are typically independently generated, so they may not refer to the same entities in the image, and when they do describe the same object they may be referenced in different ways (e.g., a black dog vs. a Rottweiler). However, we assume they convey the same general sentiment since they describe the same image. Thus, the multi-layer neighborhood constraints try to encourage sentences from the same image to embed near each other. These constraints are analogous to those proposed in related work on image-sentence matching [Gella et al.2017, Wang et al.2019], except that we apply the constraints at multiple layers of our network. Namely, we use the neighborhood constraints on the MULE layer as well as the multimodal embedding layer as done in prior work.
To obtain sentence representations in the MULE space, we simply average the features of each word, which we found to perform better than using an LSTM while increasing model efficiency (an observation also made by \citeauthorburnsLanguage2019,wang2019learning \citeyearburnsLanguage2019,wang2019learning). For the multimodal embedding space, we use the same features for a multimodal sentence representation that is used to relate to the image features. We denote the averaged representations in the MULE space (i.e. MULE sentence embeddings) as and multimodal sentence embeddings as as shown in Fig. 2.
The neighborhood constraints are enforced using a triplet loss function. For some specific sentence embedding , where and denote a positive and negative pair for , respectively. We use the same notation for positive and negative pairs and . Positive and negative pairs may be from any language. So, for example, German and Czech sentences describing the same image are all positive pairs, while any pair of sentences from different images we assume are negatives (analogous assumptions were made in \citeauthorgellaEMNLP2017,wang2019learning \citeyeargellaEMNLP2017,wang2019learning). Given a cosine distance function , the margin-based triplet loss is to minimize with a margin :
Following \citeauthorwang2019learning \citeyearwang2019learning, we enumerate all positive and negative pairs in a minibatch and use the top most violated constraints, where in our experiments.
Language Domain Alignment
Inspired by the domain adaptation approach of \citeauthorganin2014unsupervised,tzeng2014deep \citeyearganin2014unsupervised,tzeng2014deep, we use an adversarial language classifier (LC) to align the feature distributions of the different languages supported by our model. The goal is to project each language domain into a single shared domain, so that the model transfers knowledge between languages. This classifier does not require paired language data. We use a single fully connected layer for the LC denoted by . Given a MULE sentence representation presented in -th language, we first minimize the objective function w.r.t the language classifier :
Then, in order to align the language domain, we learn language-specific parameters to maximize the loss function.
To directly learn the visual meaning of words we also use a multimodal model to relate sentences to images which is trained along with our MULE embedding. To accomplish this, we use a two-branch network similar to that of \citeauthorwang2019learning \citeyearwang2019learning, except we use the last hidden state of an LSTM to obtain a final multimodal sentence representation ( in Fig. 2). Although \citeauthorburnsLanguage2019,wang2019learning \citeyearburnsLanguage2019,wang2019learning found mean-pooled features followed by a pair of fully connected layers often perform better, we found using an LSTM to be more stable in our experiments. We also kept image representation fixed, and only the two fully connected layers after the CNN in Fig. 2 were trained.
Let denote the image representation and denote the sentence representation in the multimodal embedding space for the i-th image . We construct a minibatch that contains positive image-sentence pairs from different images. In the batch, we get from the image-sentence pair . It should be noted that sentences can be presented in multiple languages. We sample triplets to have negative pairs and positive pairs for image representations and sentence representations. To be specific, given , we sample corresponding positive sentence representation and a negative sentence representation represented in the same language. Equivalently, given a , we sample the positive image representation and a negative image representation . Then, our margin-based objective function for matching is to minimize with a margin and a cosine distance function :
As with the neighborhood constraints, the loss is computed over the most violated constraints. Finally, our overall objective function is to find:
where includes all parameters in our network except for the language classifier, contains the parameters of the language classifier, and determines weights on each loss.
Multi30K [Elliott et al.2016, Elliott et al.2017, Barrault et al.2018]. The Multi30K dataset augments Flickr30K [Young et al.2014] with image descriptions in German, French, and Czech. Flickr30K contains 31,783 images where each image is paired with five English descriptions. There are also five sentences provided per image in German, but only one sentence per image is provided for French and Czech. French and Czech sentences are translations of their English counterparts, but German sentences were independently generated. We use the dataset’s provided splits which uses 29K/1K/1K images for training/test/validation.
MSCOCO [Lin et al.2014]. MSCOCO is a large-scale dataset which contains 123,287 images and each image is paired with 5 English sentences. Although this accounts for a much larger English training set compared with Multi30K, but there are fewer annotated sentences in other languages. \citeauthormiyazaki2016cross \citeyearmiyazaki2016cross released the YJ Captions 26K dataset which contains about 26K images in MSCOCO where each image is paired with independent 5 Japanese descriptions. \citeauthorli2019coco \citeyearli2019coco provides 22,218 independent Chinese image descriptions for 20,341 images in MSCOCO. There are only about 4K image descriptions which are shared across the three languages. Thus, in this dataset, an additional challenge is the need to use unpaired language data. We randomly selected 1K images for the testing and validation sets from the images which contain descriptions across all three languages, for a total of 2K images, and used the rest for training. Since we use the different data split, it is not possible to compare directly with prior monolingual methods. We provide a fair comparison with our baseline and prior monolingual methods in the supplementary.
Machine Translations. As shown in Table 1, there is considerable disparity in the availability of annotations for training in different languages. As a way of augmenting these datasets, we use Google’s online translator to generate sentences in other languages. Since the sentences in other languages are independently generated, their translations can provide additional variation in the training data. This also enables us to evaluate the effectiveness of NMT. In addition, we use these translated sentences to benchmark the performance translating languages from an unsupported language into one of the languages for which we have a trained model (e.g. translate a sentence from Chinese into English and perform the query using an English-trained model).
Image-Sentence Matching Results
Metrics. Performance on the image-sentence matching task is typically reported as Recall for both image-to-sentence and sentence-to-image (e.g. as done in \citeauthorgellaEMNLP2017,nam2017dual,wang2019learning \citeyeargellaEMNLP2017,nam2017dual,wang2019learning), resulting in performance reported over six values per language. Results reporting performance over all the six values for each language can be found in the supplementary. In this paper, we average them to obtain an overall score (mR) for each compared method/language.
Model Architecture. We compare the following models:
PARALLEL-EmbN. This model borrows ideas from \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017 to modify EmbN. Specifically, only a single image representation is trained, but it contains separate language branches.
Multi30K Discussion. We report performance on the Multi30K dataset in Table 2. The first line of Table 2(a) reports performance when training completely separate models (i.e. no shared parameters) for each language in the dataset. The significant discrepancy between the performance of English and German compared to Czech and French can be attributed to the differences in the number of sentences available for each language (Czech and French have 1/5th the sentences as seen in Table 1). Performance improves across all languages using the PARALLEL model in Table 2(a), demonstrating that the representation learned for the languages with more available annotations can still be leveraged to the benefit of other languages.
Table 2(b) and Table 2(c) show the the results of using multilingual embeddings, ML BERT [Devlin et al.2018] 222https://github.com/google-research/bert/ and MUSE [Conneau et al.2018] which learns a shared FastText-style embedding space for all supported languages. This enables us to compare against aligning languages using language-only data vs. our approach which performs a visual-semantic language alignment. Note that a single EmbN model is trained across all languages when using MUSE rather than training separate models since the embeddings are already aligned across languages. Comparing the numbers of Table 2(a) and Table 2(b), we observe that ML BERT which is a state-of-the-art method in NLP performs much worse than the monolingual FastText. In addition, we see in Table 2(c) that MUSE improves performance on low-resourced languages (i.e. French and Czech), but actually hurts performance on the language with more available annotations (i.e. English). These results indicate that some important visual-semantic knowledge is lost when relying solely on language-only data to align language embeddings and NLP method does not generalize well to the language-vision task.
Table 2(d) compares the effect that different components of MULE has on performance. Going from the last line of Table 2(a) to the first line of Table 2(d) demonstrates that using a single-shared language branch can significantly improve lower-resource language performance (i.e. French and Czech), with only a minor impact to performance on languages with more annotations. Comparing the last line of Table 2(c) which reports performance of our full model using MUSE embeddings, to the last line of Table 2(d), we see that using MUSE embeddings still hurts performance, which helps verify our earlier hypothesis that some important visual-semantic information is lost when aligning languages with only language data. This is also reminiscent of an observation in \citeauthorburnsLanguage2019 \shortciteburnsLanguage2019, i.e., it is important to consider the visual-semantic meaning of words when learning a language embedding for vision-language tasks.
Breaking down the components of our model in the last three lines of Table 2(d), we show that including the multi-layer neighborhood constraints (NC), language classifier (LC), and pretraining MULE (LP) all provide significant performance improvements. In fact, they can make up for much of the lost performance on the high-resource languages when sharing a single language branch in the multimodal model, with German actually outperforming its separate language-branch counterpart. French and Czech perform even better, however, with a total improvement of 8.7% and 12.5% mean recall over our reproductions of prior work, respectively. Clearly, training multiple languages together in a single model, especially those with fewer annotations, can result in dramatic improvements to performance without having to sacrifice the performance of a single language as long as some care is taken to ensure the model learns a comparable representation between languages. Our method achieves the best performance on German, French, and Czech, while still being comparable for English.
MSCOCO Discussion. Table 3 reports results on MSCOCO. Here, the lower resource language is Chinese, while English and Japanese have both have considerably more annotations (although, unlike German on Multi30K, English has considerably more annotations than Japanese on this dataset). For the most part we see similar behavior on the MSCOCO dataset that we saw on Multi30K - the lower resource languages (Chinese) perform worse overall compared to the higher resource languages, but a significant portion of the performance gap is reduced when using our full model. Overall, our formulation obtains a 5.9% improvement to mean recall over our baselines for Chinese, and also improves performance by 1.6% mean recall for Japanese. However, for English, we obtain a slight decrease in performance compared with the English-only model reported on the first line in Table 3(a).
The drop in performance on English could be due to the significant imbalance in the training data on this dataset, where more than 3/4 of the data contains only English captions. In our experiments we separated the data into three groups: English only, English-Japanese, and English-Chinese. We ensured each group was equally represented in the minibatch, which means some images containing Japanese or Chinese captions were sampled far more than many of the English-only images. This shift in the distribution of the training data may account for some of the loss of performance. We believe more sophisticated sampling strategies may help rectify these issues and re-gain the lost performance. That said, our model has significantly fewer parameters from learning a single language branch for all languages while also outperforming the PARALLEL model from prior work which learns separate language branches.
|Model||Training Data Source||Model||En||De||Fr||Cs||En||Cn||Ja|
|(a)||PARALLEL-EmbN||Human Generated Only (Tables 2&3)||Y||71.5||58.7||47.2||37.0||70.0||52.9||68.9|
|MULE EmbN - Full||Human Generated Only (Tables 2&3)||Y||70.7||60.3||55.9||49.5||72.0||58.8||70.5|
|(b)||EmbN & Machine||Human Generated English Only||Y||71.1||48.5||46.7||46.9||75.6||72.2||66.1|
|EmbN||Human Generated + Machine Translations||N||72.0||60.3||54.8||46.3||76.8||71.4||73.2|
|PARALLEL-EmbN||Human Generated + Machine Translations||Y||71.7||60.7||59.2||50.7||72.5||72.3||73.3|
|MULE EmbN - Full||En Others, Machine Translations Only||Y||71.0||53.3||60.2||50.6||70.1||71.6||73.7|
|MULE EmbN - Full||Human Generated + Machine Translations||Y||71.1||61.0||60.8||54.9||73.5||73.1||76.5|
Leveraging Machine Translations
As mentioned in the introduction, an alternative for training a model to support every language would be to use Neural Machine Translation to convert a query sentence from an unsupported language into a language which there is a trained model available. We test this approach using an English-trained EmbN model whose performance is reported on the first lines of Table 2(a) and Table 3(a). For each non-English language, we use Google Translate to convert the sentence from the source language into English, then use an English EmbN model to compute its similarity between all the images in the test set.
The first row of Table 4(b) reports the results of translating non-English queries into English and using the English-only model. On the Multi30K test set we see this performs worse on each non-English language than our MULE approach, but it does outperform some of the baselines trained on human-generated captions. Similar behavior is seen on the MSCOCO data, with Chinese-translated sentences actually performing nearly as well as human-generated English sentences. In short, using translations performs better on low-resourced languages (French, Czech, and Chinese) than the baselines. These results suggest that these translated sentences are able to capture enough information from the original language to still provide a representation that is “good enough” to be useful.
Since translations provide a good representation for performing the retrieval task, they should also be useful in training a new model. This is especially true for any sentences that were independently generated, as they might provide a novel sentence after being translated into other languages. We report the performance of using these translated sentences to augment our training set for both datasets in Table 4(b), where our model still obtains best overall performance. We observe that the models with the augmentation (e.g. last line of Table 4(b)) always outperform the corresponding models without the augmentation (e.g. last line of Table 4(a)) on all languages. On the second line of Table 4(b) we see that these translations are useful in providing more training examples even for a monolingual EmbN model. Comparing the fourth and last lines of Table 4(b) we see the difference between training the non-English languages using translated sentences alone and training with both human-generated and translated sentences. Even though the human-generated Chinese captions account for less than 5% of the total Chinese training data, we still see a significant performance improvement using them, with similar results on all other languages. This suggests that human-generated captions still provide better training data than machine translations. We also see comparing our full model to the PARALLEL-EmbN model and when using MUSE embeddings that using MULE provides performance benefits even when data is more plentiful.
The language branch in our experiments contained 6.8M parameters. This results in parameters for the PARALLEL-EmbN model proposed by \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017 on Multi30K (a branch for each language). MULE uses a FC layer containing parameters to project word features into the universal embedding, so an EmbN model for Multi30K that uses MULE would have parameters, half the number used by \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017. MULE also scales better with more languages than \citeauthorgellaEMNLP2017 \citeyeargellaEMNLP2017. ML BERT is much larger than MULE, consisting of 12 layers with M parameters.
Fig. 3 shows the qualitative results on our full model (MULE EmbN + NC + LC + LP). We pick the two samples and retrieve the closest sentences given an image for each language on Multi30K. For other languages, we provide English translations using Google Translate. The top example shows the perfect matching between the languages. The bottom image shows that the model overestimates contextual information from the image in the English sentence. It captures not only the correct event (car racing) but also wrong objects not presented in the image (audience and fence). This sentence came from similar images with minor differences in the test set. However, the minor differences in images can be important for matching between similar images. Learning how to accurately capture the details of an image may improve the performance in future work.
We investigate the bidirectional multilingual image-sentence retrieval task. We proposed learning a MULE that can handle multiple language queries with negligible language-specific parameters unlike prior work which learned completely distinct representations for each language. In addition to being more scalable, our method enables the model to transfer knowledge between languages, resulting in especially good performance on lower-resource languages. In addition, in order to overcome limited annotations, we show that leveraging Neural Machine Translation to augment a training dataset can significantly increase performance for training both a multilingual network as well as monolingual model. Although our work primarily focused on image-sentence retrieval task, our approach is modular and can be easily incorporated into many other vision-language models and tasks.
- [Antol et al.2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: Visual Question Answering. In ICCV.
- [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
- [Barrault et al.2018] Barrault, L.; Bougares, F.; Specia, L.; Lala, C.; Elliott, D.; and Frank, S. 2018. Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers.
- [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. TACL 5:135–146.
- [Burns et al.2019] Burns, A.; Tan, R.; Saenko, K.; Sclaroff, S.; and Plummer, B. A. 2019. Language features matter: Effective language representations for vision-language tasks. In ICCV.
- [Calixto, Liu, and Campbell2017] Calixto, I.; Liu, Q.; and Campbell, N. 2017. Multilingual multi-modal embeddings for natural language processing. arXiv:1702.01101.
- [Conneau et al.2018] Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and Jégou, H. 2018. Word translation without parallel data. In ICLR.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR.
- [Devlin et al.2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In arXiv:1810.04805v1.
- [Elliott et al.2016] Elliott, D.; Frank, S.; Sima’an, K.; and Specia, L. 2016. Multi30k: Multilingual english-german image descriptions. arXiv:1605.00459.
- [Elliott et al.2017] Elliott, D.; Frank, S.; Barrault, L.; Bougares, F.; and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv:1710.07177.
- [Fang et al.2015] Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al. 2015. From captions to visual concepts and back. In CVPR.
- [Ganin and Lempitsky2014] Ganin, Y., and Lempitsky, V. 2014. Unsupervised domain adaptation by backpropagation. arXiv:1409.7495.
- [Gella et al.2017] Gella, S.; Sennrich, R.; Keller, F.; and Lapata, M. 2017. Image pivoting for learning multilingual multimodal representations. In EMNLP.
- [Goyal et al.2017] Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR.
- [Gu et al.2018] Gu, J.; Joty, S.; Cai, J.; and Wang, G. 2018. Unpaired image captioning by language pivoting. In ECCV.
- [Gupta, Schwing, and Hoiem2019] Gupta, T.; Schwing, A.; and Hoiem, D. 2019. Vico: Word embeddings from visual co-occurrences. In ICCV.
- [He et al.2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. arXiv:1512.03385.
- [Hitschler, Schamoni, and Riezler2016] Hitschler, J.; Schamoni, S.; and Riezler, S. 2016. Multimodal pivots for image caption translation. In ACL.
- [Hu et al.2016] Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; and Darrell, T. 2016. Natural language object retrieval. In CVPR.
- [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
- [Johnson et al.2017] Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL 5:339–351.
- [Kiela et al.2018] Kiela, D.; Conneau, A.; Jabri, A.; and Nickel, M. 2018. Learning visually grounded sentence representations. In NAACL-HLT.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
- [Kiros, Salakhutdinov, and Zemel2015] Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2015. Unifying visual-semantic embeddings with multimodal neural language models. TACL.
- [Kottur et al.2016] Kottur, S.; Vedantam, R.; Moura, J. M. F.; and Parikh, D. 2016. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. In CVPR.
- [Lan, Li, and Dong2017] Lan, W.; Li, X.; and Dong, J. 2017. Fluency-guided cross-lingual image captioning. In ACM-MM.
- [Li et al.2019] Li, X.; Xu, C.; Wang, X.; Lan, W.; Jia, Z.; Yang, G.; and Xu, J. 2019. Coco-cn for cross-lingual image tagging, captioning and retrieval. Transactions on Multimedia.
- [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV.
- [Lu et al.2019] Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265.
- [Mikolov, Yih, and Zweig2013] Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic regularities in continuous space word representations. In NAACL-HLT.
- [Miyazaki and Shimizu2016] Miyazaki, T., and Shimizu, N. 2016. Cross-lingual image caption generation. In ACL.
- [Nam, Ha, and Kim2017] Nam, H.; Ha, J.-W.; and Kim, J. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR.
- [Nguyen and Okatani2019] Nguyen, D.-K., and Okatani, T. 2019. Multi-task learning of hierarchical vision-language representation. In CVPR.
- [Plummer et al.2015] Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo, J. C.; Hockenmaier, J.; and Lazebnik, S. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV.
- [Rajendran et al.2015] Rajendran, J.; Khapra, M. M.; Chandar, S.; and Ravindran, B. 2015. Bridge correlational neural networks for multilingual multimodal representation learning. arXiv:1510.03519.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NeurIPS.
- [Tan and Bansal2019] Tan, H., and Bansal, M. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP.
- [Tzeng et al.2014] Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474.
- [Vendrov et al.2016] Vendrov, I.; Kiros, R.; Fidler, S.; and Urtasun, R. 2016. Order embeddings of images and language. In ICLR.
- [Wang et al.2019] Wang, L.; Li, Y.; Huang, J.; and Lazebnik, S. 2019. Learning two-branch neural networks for image-text matching tasks. TPAMI 41(2):394–407.
- [Young et al.2014] Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78.
We train our models for 20 epochs using Adam [Kingma and Ba2014] with a learning rate of 1e-4 that we decay exponentially with a batch size of 500 images. After obtaining the 300-dimensional FastText embeddings [Bojanowski et al.2017], they are projected into a 512-dimensional universal embedding. Then, the univeral embedding features are fed into an LSTM with 1024 units before being projected into the final 512-dimensional multimodal embedding space. We extract our image representation using a 152-layer ResNet [He et al.2015] that was trained on ImageNet [Deng et al.2009]. An image representation was averaged over 10 crops with input image dimensions of 448x448, resulting in a 2048-dimensional image representation. After obtaining out 2048-dimensional image features, we use a pair of fully connected layers with output sizes of 2048 and 512, respectively, to project the image features into the shared multimodal embedding space. These fully connected layers are separated with a ReLU non-linearity and use batch normalization [Ioffe and Szegedy2015]. Our language classifier is implemented as a single fully connected layer that takes mean-pooled universal embedding features as an input. We set all the values from Eq.4 to 1. For all distance computations, we use cosine distance. While we keep the ResNet fixed in our experiments, we fine-tune the FastText embeddings during training. As done in \citeauthorburnsLanguage2019 \citeyearburnsLanguage2019, we regularize the word embeddings to help avoid catastrophic forgetting using a regularization weight of 5e-7.
In this section we explain how we set up a batch for training. We use a batch size of different 500 images for both Multi30K and MSCOCO. For Multi30K, all sentences can be paired with each language. However, there are five sentences per image for English and German, but only one sentence per image for Czech and French. Therefore, given an image, we randomly choose two sentences for English and German but one sentence for Czech and French (for a total of six sentences per image). For MSCOCO, the number of availability of different languages during training is significantly unbalanced. English has 606K sentences for 121K images, Japanese has 122K sentences for 26K images, and Chinese has 20K sentences for 18K images. This results in five sentences per image for every image containing English and Japanese, but only 1-2 sentences per image for Chinese. In addition, only about 4K images are shared (paired) across all three languages. We separate the data into three groups: English only, English-Japanese, and English-Chinese. We sample images from each group equally. Given an image, we randomly choose two sentences for English and Japanese but only select one sentence for Chinese. As a reminder, for all triplet loss functions (i.e., the multi-layer neighborhood constraints and image-sentence matching loss), we enumerate all possible triplets in a minibatch and keep at most 10 triplets with the highest loss.
|EmbN (our implementation)||N||39.7||69.9||78.8||31.2||62.7||72.7||59.2||–||–||–||–||–||–||–|
|PARALLEL-EmbN (2 Lang)||Y||58.4||83.8||89.9||42.7||73.3||82.2||71.7||43.0||72.8||82.6||30.4||57.7||68.6||59.2|
In order to compute the sentence-level representation from the multilingual BERT [Devlin et al.2018], we use the publicly available pretrained model and use the public API “bert-as-service” 333https://github.com/hanxiao/bert-as-service, which takes a mean-pooling strategy for a sentence embedding. After we compute the sentence-level embedding, we cache the features and use these features to training. Although we might be able to improve performance by fine-tuning the ML BERT model, its large size (M parameters) makes it impossible to fit into GPU memory with the very large number of image-sentence pairs and additional model parameters used for training.
In the main paper, we report mean recall (mR) which is an average score of Recall, Recall, and Recall on Image-Sentence retrieval. In the supplementary material, we report all the scores at each threshold on Multi30K [Elliott et al.2016, Elliott et al.2017, Barrault et al.2018] and MSCOCO [Lin et al.2014]. The scores include Image-to-Sentence and Sentence-to-Image retrieval results. Our key observations are as follows: (1) Our model allows low-resource languages to transfer knowledge from other languages while being more scalable than baselines; (2) MULE performs better than MUSE and ML-BERT in most cases; (3) For low-resource languages, Machine Translation provides additional supervision and can be used as data augmentation. With the augmentation, our model improves mean recall by a large margin and obtains the highest scores on MSCOCO and Multi30k. In addition, the augmentation also improves mean recall on English.
Comparison on Visual Features
As shown in \citeauthorburnsLanguage2019 \citeyearburnsLanguage2019, EmbN is the state-of-the-art image-sentence model when using image-level ResNet features and good language features. However, prior work comparing cross-lingual image-sentence retrieval models only reported performance using VGG features. These include VSE [Kiros, Salakhutdinov, and Zemel2015], Order Embeddings [Vendrov et al.2016], PARALLEL-SYM/ASYM [Gella et al.2017], and our implementation of EmbN [Wang et al.2019]. Table 5 also shows the effect of going from VGG [Simonyan and Zisserman2014] to ResNet features. As a reminder, we made some minor modifications to EmbN (see discussion in our paper), and we used the provided split for Multi30K, which is different than was used to benchmark EmbN on Flickr30K in \citeauthorwang2019learning \citeyearwang2019learning. Despite this, our reported results (mR 59.2) are quite comparable to the results in \citeauthorwang2019learning \citeyearwang2019learning (mR 60.0).
Non-English Languages on Multi30K
From Table 6 to Table 8, the tables represent the performances of German, French, and Czech on Multi30K. For these three languages, our model (MULE EmbN + SA + LC + LP) outperforms the baselines, EmbN and PARALLEL-EmbN, by a large margin (German: Table 6(a) vs Table 6(d); French: Table 7(a) vs Table 7(d); Czech: Table 8(a) vs Table 8(d)). At the same time, our method is more scalable than others by using a universal embedding and a shared LSTM. The results show that our model transfer knowledge between languages and the performances of the low-resource languages are improved due to the alignment in languages. Especially for low-resource languages (i.e. French and Czech), the improvement is more significant than that of German. By comparing the number of (b, c) and (d) in Tables 6, 7, 8, MUSE and multilingual BERT perform worse than our MULE. In addition, we observe that the low-resource languages can take benefit from Machine Translation. Comparing Table 7(a) and Table 7(e) and, baseline methods have similar performances to the model which is trained only on English and takes English translation sentences. The advantages of Machine Translation are more evident in Czech by comparing numbers in Table 8(a) and Table 8(f). Overall, we improve mean recall by 1.6% for German, 8.7% for French, and 12.5% for Czech compared to the baseline PARALLEL-EmbN. After we augment the dataset with Machine Translation, we improve mean recall by 2.3% for German, 13.6% for French, and 17.9% for Czech compared to the baseline PARALLEL-EmbN.
|EmbN + NC + LC + LP||Y||44.8||73.6||83.3||30.2||58.3||67.8||59.7|
|EmbN + NC||Y||45.2||73.8||82.7||30.6||58.2||68.5||59.8|
|EmbN + NC + LC||Y||42.2||73.3||82.9||32.0||60.1||70.0||60.2|
|EmbN + NC + LC + LP (Full)||Y||45.7||73.3||83.2||31.1||58.7||69.0||60.3|
|(e)||Translation to English|
|EmbN - English Model||Y||34.1||60.4||71.1||19.6||47.4||58.5||48.5|
|(f)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||36.9||66.8||76.8||25.9||51.4||62.3||53.3|
|MULE EmbN - Full||Y||44.0||73.9||82.4||33.3||60.9||71.4||61.0|
|EmbN + NC + LC + LP||Y||30.3||60.5||69.5||31.1||58.4||71.7||53.6|
|EmbN + NC||Y||31.1||61.3||71.4||33.6||61.1||72.2||55.1|
|EmbN + NC + LC||Y||30.0||61.9||72.0||34.1||61.5||73.7||55.5|
|EmbN + NC + LC + LP (Full)||Y||32.3||61.7||72.6||33.3||63.3||72.0||55.9|
|(e)||Translation to English|
|EmbN - English Model||Y||22.5||52.5||63.0||25.1||53.1||63.9||46.7|
|(f)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||37.8||66.3||76.5||38.7||65.9||76.1||60.2|
|MULE EmbN - Full||Y||37.2||66.6||77.9||38.4||68.1||76.3||60.8|
|EmbN + NC + LC + LP||Y||21.3||44.1||54.2||22.9||46.3||57.0||41.0|
|EmbN + NC||Y||24.1||50.0||63.4||25.4||51.7||63.9||46.4|
|EmbN + NC + LC||Y||24.7||52.9||63.2||25.2||53.2||65.4||47.4|
|EmbN + NC + LC + LP (Full)||Y||26.3||54.5||65.3||27.9||56.3||66.7||49.5|
|(e)||Translation to English|
|EmbN - English Model||Y||23.0||50.9||64.7||25.1||53.4||64.2||46.9|
|(f)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||28.6||55.8||66.5||29.4||55.7||67.7||50.6|
|MULE EmbN - Full||Y||32.0||60.4||71.5||32.8||61.1||71.7||54.9|
Non-English Languages on MSCOCO
We see similar behavior on MSCOCO that our method performs better on low-resource languages than the baselines. The big difference from Multi30K is that Machine Translation significantly improves performances on Chinese as shown in Table 9 (d). This could be due to the fact that the number of Chinese annotations is much less than that of other languages. Based on the observation, we augment the dataset with Machine Translation. We show that our model with the augmentation obtain the highest scores as shown in Table 9 (e) and Table 10 (e). Our approach with the data augmentation improves mean recall by 20.2% for Chinese and 7.6% for Japanese compared to the baseline PARALLEL-EmbN.
|EmbN + NC||Y||28.7||62.2||74.4||34.2||65.5||74.5||56.6|
|EmbN + NC + LC||Y||32.9||63.9||76.7||31.7||65.3||76.7||57.9|
|EmbN + NC + LC + LP (Full)||Y||34.4||60.2||75.8||34.9||66.0||77.2||58.8|
|(d)||Translation to English|
|EmbN - English Model||Y||45.9||79.8||89.2||47.8||81.1||89.4||72.2|
|(e)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||46.5||79.9||88.5||46.1||79.7||88.7||71.6|
|MULE EmbN - Full||Y||48.5||79.6||89.9||49.5||81.2||89.8||73.1|
|EmbN + NC||Y||47.3||80.4||90.7||39.3||73.9||85.5||69.5|
|EmbN + NC + LC||Y||49.8||81.5||91.6||40.1||73.6||85.4||70.3|
|EmbN + NC + LC + LP (Full)||Y||49.9||81.4||92.0||40.4||73.8||85.5||70.5|
|(d)||Translation to English|
|EmbN - English Model||Y||44.8||74.3||85.4||36.9||71.0||84.7||66.1|
|(e)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||55.8||83.9||91.8||44.6||78.0||88.3||73.7|
|MULE EmbN - Full||Y||61.2||86.9||93.9||47.9||79.5||89.5||76.5|
English on Multi30K and MSCOCO
Table 11 shows the overall results on English. The monolingual model (EmbN) performs better on English than multilingual models on Multi30K and MSCOCO. From Table 11(e), we observe that Machine translation also improves the performances on English.
|EmbN NC + LC + LP||Y||55.1||81.6||88.2||39.7||69.9||79.7||69.0||–||–||–||–||–||–||–|
|EmbN + NC||Y||54.3||84.2||90.0||43.1||72.5||81.9||71.0||51.1||83.5||91.0||38.7||71.5||82.8||69.8|
|EmbN + NC + LC||Y||53.5||83.6||90.2||42.3||73.3||82.6||70.9||51.3||83.0||92.9||40.1||74.2||86.2||71.3|
|EmbN + NC + LC + LP (Full)||Y||54.8||82.8||89.6||42.2||72.4||82.1||70.7||53.5||83.1||92.9||42.1||74.4||85.9||72.0|
|(e)||Translation Data Augmentation|
|MULE EmbN - Full - Trans Only||Y||56.2||82.9||89.5||42.5||72.9||82.2||71.0||48.2||80.6||89.2||40.6||75.5||86.6||70.1|
|MULE EmbN - Full||Y||56.4||81.7||89.1||43.8||73.4||82.0||71.1||53.3||84.0||91.8||44.8||78.6||88.3||73.5|