Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Guillem Collell
Department of Computer Science
KU Leuven
&Marie-Francine Moens
Department of Computer Science
KU Leuven

Feed-forward networks are widely used in cross-modal applications to bridge modalities by mapping distributed vectors of one modality to the other, or to a shared space. The predicted vectors are then used to perform e.g., retrieval or labeling. Thus, the success of the whole system relies on the ability of the mapping to make the neighborhood structure (i.e., the pairwise similarities) of the predicted vectors akin to that of the target vectors. However, whether this is achieved has not been investigated yet. Here, we propose a new similarity measure and two ad hoc experiments to shed light on this issue. In three cross-modal benchmarks we learn a large number of language-to-vision and vision-to-language neural network mappings (up to five layers) using a rich diversity of image and text features and loss functions. Our results reveal that, surprisingly, the neighborhood structure of the predicted vectors consistently resembles more that of the input vectors than that of the target vectors. In a second experiment, we further show that untrained nets do not significantly disrupt the neighborhood (i.e., semantic) structure of the input vectors.

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Guillem Collell Department of Computer Science KU Leuven                        Marie-Francine Moens Department of Computer Science KU Leuven

1 Introduction

Neural network mappings are widely used to bridge modalities or spaces in cross-modal retrieval (Qiao et al., 2017; Wang et al., 2016; Zhang et al., 2016), zero-shot learning (Lazaridou et al., 2015b, 2014; Socher et al., 2013) in building multimodal representations (Collell et al., 2017) or in word translation (Lazaridou et al., 2015a), to name a few. Typically, a neural network is firstly trained to predict the distributed vectors of one modality (or space) from the other. At test time, some operation such as retrieval or labeling is performed based on the nearest neighbors of the predicted (mapped) vectors. For instance, in zero-shot image classification, image features are mapped to the text space and the label of the nearest neighbor word is assigned. Thus, the success of such systems relies entirely on the ability of the map to make the predicted vectors similar to the target vectors in terms of semantic or neighborhood structure.111We indistinctly use the terms semantic structure, neighborhood structure and similarity structure. They refer to all pairwise similarities of a set of vectors, for some similarity measure (e.g., Euclidean or cosine). However, whether neural nets achieve this goal in general has not been investigated yet. In fact, recent work evidences that considerable information about the input modality propagates into the predicted modality (Collell et al., 2017; Lazaridou et al., 2015b; Frome et al., 2013).

To shed light on these questions, we first introduce the (to the best of our knowledge) first existing measure to quantify similarity between the neighborhood structures of two sets of vectors. Second, we perform extensive experiments in three benchmarks where we learn image-to-text and text-to-image neural net mappings using a rich variety of state-of-the-art text and image features and loss functions. Our results reveal that, contrary to expectation, the semantic structure of the mapped vectors consistently resembles more that of the input vectors than that of the target vectors of interest. In a second experiment, by using six concept similarity tasks we show that the semantic structure of the input vectors is preserved after mapping them with an untrained network, further evidencing that feed-forward nets naturally preserve semantic information about the input. Overall, we uncover and rise awareness of a largely ignored phenomenon relevant to a wide range of cross-modal / cross-space applications such as retrieval, zero-shot learning or image annotation.

Ultimately, this paper aims at: (1) Encouraging the development of better architectures to bridge modalities / spaces; (2) Advocating for the use of semantic-based criteria to evaluate the quality of predicted vectors such as the neighborhood-based measure proposed here, instead of purely geometric measures such as mean squared error (MSE).

2 Related Work and Motivation

Neural network and linear mappings are popular tools to bridge modalities in cross-modal retrieval systems. Lazaridou et al. (2015b) leverage a text-to-image linear mapping to retrieve images given text queries. Weston et al. (2011) map label and image features into a shared space with a linear mapping to perform image annotation. Alternatively, Frome et al. (2013), Lazaridou et al. (2014) and Socher et al. (2013) perform zero-shot image classification with an image-to-text neural network mapping. Instead of mapping to latent features, Collell et al. (2018) use a 2-layer feed-forward network to map word embeddings directly to image pixels in order to visualize spatial arrangements of objects. Neural networks are also popular in other cross-space applications such as cross-lingual tasks. Lazaridou et al. (2015a) learn a linear map from language A to language B and then translate new words by returning the nearest neighbor of the mapped vector in the B space.

Figure 1: Effect of applying a mapping to a (disconnected) manifold with three hypothetical classes (, and ).

In the context of zero-shot learning, shortcomings of cross-space neural mappings have also been identified. For instance, “hubness” (Radovanović et al., 2010) and “pollution” (Lazaridou et al., 2015a) relate to the high-dimensionality of the feature spaces and to overfitting respectively. Crucially, we do not assume that our cross-modal problem has any class labels, and we study the similarity between input and mapped vectors and between output and mapped vectors.

Recent work evidences that the predicted vectors of cross-modal neural net mappings are still largely informative about the input vectors. Lazaridou et al. (2015b) qualitatively observe that abstract textual concepts are grounded with the visual input modality. Counterintuitively, Collell et al. (2017) find that the vectors “imagined” from a language-to-vision neural map, outperform the original visual vectors in concept similarity tasks. The paper argued that the reconstructed visual vectors become grounded with language because the map preserves topological properties of the input. Here, we go one step further and show that the mapped vectors often resemble the input vectors more than the target vectors in semantic terms, which goes against the goal of a cross-modal map.

Well-known theoretical work shows that networks with as few as one hidden layer are able to approximate any function (Hornik et al., 1989). However, this result does not reveal much neither about test performance nor about the semantic structure of the mapped vectors. Instead, the phenomenon described is more closely tied to other properties of neural networks. In particular, continuity guarantees that topological properties of the input, such as connectedness, are preserved (Armstrong, 2013). Furthermore, continuity in a topology induced by a metric also ensures that points that are close together are mapped close together. As a toy example, Fig. 1 illustrates the distortion of a manifold after being mapped by a neural net.222Parameters of these mappings were generated at random.

In a noiseless world with fully statistically dependent modalities, the vectors of one modality could be perfectly predicted from those of the other. However, in real-world problems this is unrealistic given the noise of the features and the fact that modalities encode complementary information (Collell and Moens, 2016). Such unpredictability combined with continuity and topology-preserving properties of neural nets propel the phenomenon identified, namely mapped vectors resembling more the input than the target vectors, in nearest neighbors terms.

3 Proposed Approach

To bridge modalities and , we consider two popular cross-modal mappings .

(i) Linear mapping (lin):

with , , where and are the input and output dimensions respectively.

(ii) Feed-forward neural network (nn):

with , , , where is the number of hidden units and the non-linearity (e.g., tanh or sigmoid). Although single hidden layer networks are already universal approximators (Hornik et al., 1989), we explored whether deeper nets with 3 and 5 hidden layers could improve the fit (see Supplement).

Loss: Our primary choice is the MSE: , where is the target vector. We also tested other losses such as the cosine: and the max-margin: , where belongs to a different class than , and is the margin. As in Lazaridou et al. (2015a) and Weston et al. (2011), we choose the first that violates the constraint. Notice that losses that do not require class labels such as MSE are suitable for a wider, more general set of tasks than discriminative losses (e.g., cross-entropy). In fact, cross-modal retrieval tasks often do not exhibit any class labels. Additionally, our research question concerns the cross-space mapping problem in isolation (independently of class labels).

Let us denote a set of input and output vectors by and respectively. Each input vector is paired to the output vector of the same index (). Let us henceforth denote the mapped input vectors by . In order to explore the similarity between and , and between and , we propose two ad hoc settings below.

3.1 Neighborhood Structure of Mapped Vectors (Experiment 1)

To measure the similarity between the neighborhood structure of two sets of paired vectors and , we propose the mean nearest neighbor overlap measure (). We define the nearest neighbor overlap as the number of nearest neighbors that two paired vectors share in their respective spaces. E.g., if the (= ) nearest neighbors of in are and those of in are , the is 2.

Definition 1

Let and be two sets of paired vectors. We define:


with , where and are the indexes of the nearest neighbors of and , respectively.

The normalizing constant simply scales between 0 and 1, making it independent of the choice of . Thus, a means that the vectors in and share, on average, 70% of their nearest neighbors. Notice that mNNO implicitly performs retrieval for some similarity measure (e.g., Euclidean or cosine), and quantifies how semantically similar two sets of paired vectors are.

3.2 Mapping with Untrained Networks (Experiment 2)

To complement the setting above (Sect. 3.1), it is instructive to consider the limit case of an untrained network. Concept similarity tasks provide a suitable setting to study the semantic structure of distributed representations (Pennington et al., 2014). That is, semantically similar concepts should ideally be close together. In particular, our interest is in comparing with its projection through a mapping with random parameters, to understand the extent to which the mapping may disrupt or preserve the semantic structure of .

4 Experimental Setup

4.1 Experiment 1

4.1.1 Datasets

To test the generality of our claims, we select a rich diversity of cross-modal tasks involving texts at three levels: word level (ImageNet), sentence level (IAPR TC-12), and document level (Wiki).

ImageNet (Russakovsky et al., 2015). Consists of 14M images, covering 22K WordNet synsets (or meanings). Following Collell et al. (2017), we take the most relevant word for each synset and keep only synsets with more than 50 images. This yields 9,251 different words (or instances).

IAPR TC-12 (Grubinger et al., 2006). Contains 20K images (18K train / 2K test) annotated with 255 labels. Each image is accompanied with a short description of one to three sentences.

Wikipedia (Pereira et al., 2014). Has 2,866 samples (2,173 train / 693 test). Each sample is a section of a Wikipedia article paired with one image.

4.1.2 Hyperparameters and Implementation

See the Supplement for details.

4.1.3 Image and Text Features

To ensure that results are independent of the choice of image and text features, we use 5 (2 image + 3 text) features of varied dimensionality (64-, 128-, 300-, 2,048-) and two directions, text-to-image () and image-to-text (). We make our extracted features publicly available.333

Text. In ImageNet we use 300-dimensional GloVe444 (Pennington et al., 2014) and 300- word2vec (Mikolov et al., 2013) word embeddings. In IAPR TC-12 and Wiki, we employ state-of-the-art bidirectional gated recurrent unit (biGRU) features (Cho et al., 2014) that we learn with a classification task (see Sect. 2 of Supplement).
Image. For ImageNet, we use the publicly available555 VGG-128 (Chatfield et al., 2014) and ResNet (He et al., 2015) visual features from Collell et al. (2017), where we obtained 128-dimensional VGG-128 and 2,048- ResNet features from the last layer (before the softmax) of the forward pass of each image. The final representation for a word is the average feature vector (centroid) of all available images for this word. In IAPR TC-12 and Wiki, features for individual images are obtained similarly from the last layer of a ResNet and a VGG-128 model.

4.2 Experiment 2

4.2.1 Datasets

We include six benchmarks, comprising three types of concept similarity: (i) Semantic similarity: SemSim (Silberer and Lapata, 2014), Simlex999 (Hill et al., 2015) and SimVerb-3500 (Gerz et al., 2016); (ii) Relatedness: MEN (Bruni et al., 2014) and WordSim-353 (Finkelstein et al., 2001); (iii) Visual similarity: VisSim (Silberer and Lapata, 2014) which includes the same word pairs as SemSim, rated for visual similarity instead of semantic. All six test sets contain human ratings of similarity for word pairs, e.g., (‘cat’,‘dog’).

4.2.2 Hyperparameters and Implementation

The parameters in are drawn from a random uniform distribution and are set to zero. We use a tanh activation .666We find that sigmoid and ReLu yield similar results. The output dimension is set to 2,048 for all embeddings.

4.2.3 Image and Text Features

Textual and visual features are the same as described in Sect. 4.1.3 for the ImageNet dataset.

4.2.4 Similarity Predictions

We compute the prediction of similarity between two vectors with both the cosine and the Euclidean similarity .777Notice that papers generally use only cosine similarity (Lazaridou et al., 2015b; Pennington et al., 2014).

4.2.5 Performance Metrics

As is common practice, we evaluate the predictions of similarity of the embeddings (Sect. 4.2.4) against the human similarity ratings with the Spearman correlation . We report the average of 10 sets of randomly generated parameters.

5 Results and Discussion

We test statistical significance with a two-sided Wilcoxon rank sum test adjusted with Bonferroni. The null hypothesis is that a compared pair is equal. In Tab. 9, indicates that differs from (p 0.001) on the same mapping, embedding and direction. In Tab. 2, indicates that performance of mapped and input vectors differs (p 0.05) in the 10 runs.

5.1 Experiment 1

Results below are with cosine neighbors and = 10. Euclidean neighbors yield similar results and are thus left to the Supplement. Similarly, results in ImageNet with GloVe embeddings are shown below and word2vec results in the Supplement. The choice of = had no visible effect on results. Results with 3- and 5-layer nets did not show big differences with the results below (see Supplement). The cosine and max-margin losses performed slightly worse than MSE (see Supplement). Although Lazaridou et al. (2015a) and Weston et al. (2011) find that max-margin performs the best in their tasks, we do not find our result entirely surprising given that max-margin focuses on inter-class differences while we look also at intra-class neighbors (in fact, we do not require classes).

Tab. 9 shows our core finding, namely that the semantic structure of resembles more that of than that of , for both lin and nn maps.

ResNet VGG-128


lin 0.681 0.262 0.723 0.236
nn 0.622 0.273 0.682 0.246
lin 0.379 0.241 0.339 0.229
nn 0.354 0.27 0.326 0.256


lin 0.358 0.214 0.382 0.163
nn 0.336 0.219 0.331 0.18
lin 0.48 0.2 0.419 0.167
nn 0.413 0.225 0.372 0.182


lin 0.235 0.156 0.235 0.143
nn 0.269 0.161 0.282 0.148
lin 0.574 0.156 0.6 0.148
nn 0.521 0.156 0.511 0.151
Table 1: Test mean nearest neighbor overlap. Boldface indicates the largest score at each and pair, which are abbreviated by and .
Figure 2: Learning a nn model in Wiki (left), IAPR TC-12 (middle) and ImageNet (right).

Fig. 2 is particularly revealing. If we would only look at train performance (and allow train MSE to reach 0) then and clearly train while can only be smaller than 1. However, the interest is always on test samples, and (near-)perfect test prediction is unrealistic. Notice in fact in Fig. 2 that even if we look at train fit, MSE needs to be close to 0 for to be reasonably large. In all the combinations from Tab. 9, the test never surpasses test for any number of epochs, even with an oracle (not shown).

5.2 Experiment 2

Tab. 2 shows that untrained linear () and neural net () mappings preserve the semantic structure of the input , complementing thus the findings of Experiment 1. Experiment 1 concerns learning, while, by “ablating” the learning part and randomizing weights, Experiment 2 is revealing about the natural tendency of neural nets to preserve semantic information about the input, regardless of the choice of the target vectors and loss function.

WS-353 Men SemSim
Cos Eucl Cos Eucl Cos Eucl
(GloVe) 0.632 0.634 0.795 0.791 0.75 0.744
(GloVe) 0.63 0.606 0.798 0.781 0.763 0.712
GloVe 0.632 0.601 0.801 0.782 0.768 0.716
(ResNet) 0.402 0.408 0.556 0.554 0.512 0.513
(ResNet) 0.425 0.449 0.566 0.534 0.533 0.514
ResNet 0.423 0.457 0.567 0.535 0.534 0.516
VisSim SimLex SimVerb
Cos Eucl Cos Eucl Cos Eucl
(GloVe) 0.594 0.59 0.369 0.363 0.313 0.301
(GloVe) 0.602 0.576 0.369 0.341 0.326 0.23
GloVe 0.606 0.58 0.371 0.34 0.32 0.235
(ResNet) 0.527 0.526 0.405 0.406 0.178 0.169
(ResNet) 0.541 0.498 0.409 0.404 0.198 0.182
ResNet 0.543 0.501 0.409 0.403 0.211 0.199
Table 2: Spearman correlations between human ratings and the similarities (cosine or Euclidean) predicted from the embeddings. Boldface denotes best performance per input embedding type.

6 Conclusions

Overall, we uncovered a phenomenon neglected so far, namely that neural net cross-modal mappings can produce mapped vectors more akin to the input vectors than the target vectors, in terms of semantic structure. Such finding has been possible thanks to the proposed measure that explicitly quantifies similarity between the neighborhood structure of two sets of vectors. While other measures such as mean squared error can be misleading, our measure provides a more realistic estimate of the semantic similarity between predicted and target vectors. In fact, it is the semantic structure (or pairwise similarities) what ultimately matters in cross-modal applications.


This work has been supported by the CHIST-ERA EU project MUSTER888 and by the KU Leuven grant RUN/15/005.


  • Armstrong (2013) Mark Anthony Armstrong. 2013. Basic topology. Springer Science & Business Media.
  • Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. JAIR 49(1-47).
  • Chatfield et al. (2014) Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In BMVC.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 .
  • Chollet et al. (2015) François Chollet et al. 2015. Keras.
  • Collell and Moens (2016) Guillem Collell and Marie-Francine Moens. 2016. Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations. In COLING. ACL, pages 2807–2817.
  • Collell et al. (2018) Guillem Collell, Luc Van Gool, and Marie-Francine Moens. 2018. Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates. In AAAI. AAAI.
  • Collell et al. (2017) Guillem Collell, Teddy Zhang, and Marie-Francine Moens. 2017. Imagined Visual Representations as Multimodal Embeddings. In AAAI. AAAI, pages 4378–4384.
  • Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In WWW. ACM, pages 406–414.
  • Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In NIPS. pages 2121–2129.
  • Gerz et al. (2016) Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869 .
  • Grubinger et al. (2006) Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage. volume 5, page 10.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 .
  • Hill et al. (2015) Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41(4):665–695.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366.
  • Lazaridou et al. (2014) Angeliki Lazaridou, Elia Bruni, and Marco Baroni. 2014. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In ACL. pages 1403–1414.
  • Lazaridou et al. (2015a) Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015a. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In ACL. volume 1, pages 270–280.
  • Lazaridou et al. (2015b) Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015b. Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598 .
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. pages 3111–3119.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. volume 14, pages 1532–1543.
  • Pereira et al. (2014) Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. TPAMI 36(3):521–535.
  • Qiao et al. (2017) Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2017. Visually aligned word embeddings for improving zero-shot learning. arXiv preprint arXiv:1707.05427 .
  • Radovanović et al. (2010) Milos Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. On the existence of obstinate results in vector space models. In SIGIR. ACM, pages 186–193.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211–252.
  • Silberer and Lapata (2014) Carina Silberer and Mirella Lapata. 2014. Learning grounded meaning representations with autoencoders. In ACL. pages 721–732.
  • Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In NIPS. pages 935–943.
  • Wang et al. (2016) Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 .
  • Weston et al. (2011) Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI. volume 11, pages 2764–2770.
  • Zhang et al. (2016) Yang Zhang, Boqing Gong, and Mubarak Shah. 2016. Fast zero-shot image tagging. In CVPR. IEEE, pages 5985–5994.

    Supplementary Material of:

    Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Appendix A Hyperparameters and Implementation

Hyperparameters (including number of epochs) are chosen by 5 fold cross-validation (CV) optimizing for the test loss. Crucially, we ensure that all mappings are learned properly by verifying that the training loss steadily decreases. We search learning rates in {0.01, 0.001, 0.0001} and number of hidden units () in {64, 128, 256, 512, 1024}.

Using different number of hidden units (and selecting the best-performing one) is important in order to guarantee that our conclusions are not influenced or just a product of underfitting or overfitting. Similarly, we learned the mappings at different levels of dropout {0.25, 0.5, 0.75} which did not yield any improvement w.r.t. zero dropout (shown in our results).

We use a ReLu activation, the RMSprop optimizer (, ) and a batch size of 64. We find that sigmoid and tanh yield similar results as ReLu. Our implementation is in Keras (Chollet et al., 2015).

Since ImageNet does not have any set of “test concepts”, we employ 5-fold CV. Reported results are either averages on 5 folds (ImageNet) or 5 runs with different model weights initializations (IAPR TC-12 and Wiki).

For the max-margin loss, we choose the margin by cross-validation and explore values within .

Appendix B Textual Feature Extraction

Unlike ImageNet where we associate a word embedding to each concept, the textual modality in IAPR TC-12 and Wiki consists of sentences. In order to extract state-of-the art textual features in these datasets we train the following, separate network (prior to the cross-modal mapping). First, the embedded input sentences are passed to a bidirectional GRU of 64 units, then fed into a fully-connected layer, followed by a cross-entropy loss on the vector of class labels. We collect the 64-d averaged GRU hidden states of both directions as features. The network is trained with the Adam optimizer.

In Wiki and IAPR TC-12 we verify that the extracted text and image features are indeed informative and useful by computing their mean average precision (mAP) in retrieval (considering that a document B is relevant for document A if A and B share at least one class label). In Wiki we find mAPs of: biGRU = 0.77, ResNet = 0.22 and vgg128 = 0.21. In IAPR TC-12 we find mAPs of: biGRU = 0.77, ResNet = 0.49 and vgg128 = 0.46. Notice that ImageNet has a single data point per class in our setting, and thus mAP cannot be computed. However, we employ standard GloVe, word2vec, VGG-128 and ResNet vectors in ImageNet, which are known to perform well.

Appendix C Additional Results

Results with

(omitted in the main paper for space reasons): Interestingly, the similarity between original input and output vectors is generally low (between 1.5 and 2.3), indicating that these spaces are originally quite different. However, always remains lower than , indicating thus that the mapping makes a difference.

c.1 Experiment 1

c.1.1 Results with 3 and 5 layers

ResNet VGG-128


nn-3 0.571 0.279 0.602 0.258
nn-5 0.615 0.275 0.644 0.255
nn-3 0.274 0.27 0.254 0.256
nn-3 0.286 0.274 0.273 0.259


nn-5 0.301 0.225 0.288 0.181
nn-3 0.29 0.227 0.308 0.184
nn-3 0.324 0.229 0.294 0.18
nn-5 0.355 0.232 0.339 0.183


nn-3 0.227 0.159 0.247 0.144
nn-5 0.275 0.163 0.262 0.146
nn-3 0.367 0.148 0.342 0.145
nn-5 0.412 0.152 0.428 0.147
Table 3: Test mean nearest neighbor overlap with 3- and 5-hidden layer neural networks, using cosine-based neighbors and MSE loss. Boldface indicates best performance between each and pair, which are abbreviated by and .

It is interesting to notice that even though the difference between and has narrowed down w.r.t. the linear and 1-hidden layer models (in the main paper) in some cases (e.g., ImageNet), this does not seem to be caused by better predictions, i.e., an increase of , but rather by a decrease of . This is expected since with more layers the information about the input is less preserved.

ResNet VGG-128


nn-3 0.562 0.243 0.574 0.229
nn-5 0.61 0.241 0.619 0.227
nn-3 0.252 0.263 0.23 0.244
nn-3 0.261 0.264 0.243 0.242


nn-5 0.275 0.208 0.259 0.174
nn-3 0.262 0.207 0.276 0.174
nn-3 0.312 0.215 0.27 0.168
nn-5 0.351 0.218 0.315 0.17


nn-3 0.219 0.15 0.239 0.14
nn-5 0.259 0.152 0.25 0.143
nn-3 0.375 0.145 0.363 0.134
nn-5 0.431 0.144 0.426 0.135
Table 4: Test mean nearest neighbor overlap with 3- and 5-hidden layer neural networks, using Euclidean neighbors and MSE loss.

c.1.2 Results with the max margin loss

ResNet VGG-128


lin 0.739 0.253 0.779 0.235
nn 0.769 0.233 0.736 0.238
lin 0.526 0.252 0.454 0.241
nn 0.419 0.23 0.378 0.22


lin 0.423 0.205 0.441 0.164
nn 0.291 0.179 0.36 0.16
lin 0.674 0.198 0.604 0.17
nn 0.592 0.215 0.529 0.176


lin 0.366 0.156 0.333 0.152
nn 0.236 0.153 0.399 0.153
lin 0.725 0.151 0.723 0.146
nn 0.701 0.151 0.662 0.146
Table 5: Test mean nearest neighbor overlap with cosine-based neighbors and max margin loss.
ResNet VGG-128


lin 0.762 0.229 0.776 0.209
nn 0.776 0.213 0.724 0.214
lin 0.49 0.241 0.418 0.225
nn 0.384 0.221 0.343 0.212


lin 0.409 0.195 0.447 0.155
nn 0.275 0.172 0.329 0.15
lin 0.685 0.189 0.619 0.158
nn 0.558 0.201 0.49 0.162


lin 0.38 0.154 0.339 0.142
nn 0.232 0.144 0.398 0.141
lin 0.789 0.143 0.773 0.135
nn 0.724 0.14 0.723 0.135
Table 6: Test mean nearest neighbor overlap with Euclidean-based neighbors and max margin loss.

c.1.3 Results with the cosine loss

ResNet VGG-128


lin 0.697 0.268 0.812 0.244
nn 0.58 0.28 0.629 0.256
lin 0.382 0.241 0.336 0.224
nn 0.346 0.277 0.331 0.237


lin 0.37 0.213 0.594 0.162
nn 0.35 0.234 0.516 0.158
lin 0.469 0.205 0.405 0.169
nn 0.386 0.226 0.338 0.185


lin 0.26 0.157 0.621 0.143
nn 0.213 0.156 0.281 0.15
lin 0.549 0.157 0.53 0.154
nn 0.642 0.151 0.547 0.149
Table 7: Test mean nearest neighbor overlap with cosine-based neighbors and cosine loss.
ResNet VGG-128


lin 0.698 0.236 0.812 0.218
nn 0.562 0.238 0.597 0.218
lin 0.36 0.225 0.319 0.209
nn 0.28 0.221 0.288 0.205


lin 0.351 0.197 0.596 0.152
nn 0.295 0.201 0.452 0.144
lin 0.475 0.184 0.409 0.153
nn 0.359 0.193 0.29 0.158


lin 0.259 0.149 0.619 0.133
nn 0.212 0.147 0.262 0.144
lin 0.527 0.147 0.496 0.137
nn 0.578 0.143 0.51 0.135
Table 8: Test mean nearest neighbor overlap with Euclidean-based neighbors and cosine loss.

c.1.4 Results with Euclidean neighbors (nn and lin models of the paper)

ResNet VGG-128


lin 0.671 0.228 0.695 0.209
nn 0.61 0.234 0.665 0.219
lin 0.372 0.233 0.326 0.218
nn 0.332 0.258 0.298 0.242


lin 0.341 0.194 0.385 0.156
nn 0.3 0.203 0.318 0.17
lin 0.504 0.188 0.431 0.156
nn 0.421 0.21 0.363 0.169


lin 0.245 0.146 0.235 0.141
nn 0.261 0.151 0.269 0.143
lin 0.564 0.149 0.555 0.135
nn 0.539 0.149 0.529 0.14
Table 9: Test mean nearest neighbor overlap with Euclidean-based neighbors and MSE loss. Boldface indicates best performance between each and pair, which are abbreviated by and .
ResNet VGG-128
lin 0.57 0.16 0.644 0.159
nn 0.546 0.179 0.64 0.171
lin 0.325 0.206 0.283 0.2
nn 0.283 0.236 0.259 0.223
Table 10: Test mNNO with Euclidean-based neighbors in ImageNet dataset, using word2vec word embeddings.

c.1.5 Results with word2vec in ImageNet (cosine-based neighbors)

ResNet VGG-128
lin 0.61 0.232 0.674 0.221
nn 0.578 0.253 0.666 0.236
lin 0.364 0.213 0.348 0.21
nn 0.356 0.245 0.331 0.234
Table 11: Test mNNO using cosine-based neighbors in ImageNet, using word2vec word embeddings.

c.2 Experiment 2

WS-353 Men SemSim
Cos Eucl Cos Eucl Cos Eucl
(word2vec) 0.665 0.636 0.782 0.781 0.729 0.719
(word2vec) 0.67 0.527 0.785 0.696 0.737 0.616
word2vec 0.669 0.533 0.787 0.701 0.742 0.62
(VGG-128) 0.44 0.433 0.588 0.585 0.521 0.513
(VGG-128) 0.445 0.301 0.593 0.496 0.531 0.344
VGG-128 0.448 0.307 0.593 0.496 0.534 0.344
VisSim SimLex SimVerb
Cos Eucl Cos Eucl Cos Eucl
(word2vec) 0.566 0.567 0.419 0.379 0.309 0.232
(word2vec) 0.572 0.507 0.429 0.275 0.328 0.174
word2vec 0.576 0.51 0.435 0.279 0.308 0.15
(VGG-128) 0.551 0.547 0.404 0.399 0.231 0.235
(VGG-128) 0.56 0.404 0.406 0.335 0.23 0.316
VGG-128 0.56 0.403 0.406 0.335 0.235 0.329
Table 12: Spearman correlations between human ratings and similarities (cosine or Euclidean) predicted from the embeddings, using word2vec and VGG-128 embeddings.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description